Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
5897
Yang Cai (Ed.)
Computing with Instinct Rediscovering Artificial Intelligence
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editor Yang Cai Carnegie Mellon University CYLAB - Instinctive Computing Lab CIC-2218, 4720 Forbes Avenue, Pittsburgh, PA 15213, USA E-mail:
[email protected]
ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-19756-7 e-ISBN 978-3-642-19757-4 DOI 10.1007/978-3-642-19757-4 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011922330 CR Subject Classification (1998): H.5, I.2, F.1.1, I.6, K.4 LNCS Sublibrary: SL 7 – Artificial Intelligence
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Cover Photo. The autonomous vehicle was tested at Carnegie Mellon University, Pittsburgh campus, where the Instinctive Computing Workshop was held on June 9–10, 2009. The vehicle is designed to sense roads and avoid collisions instinctually.
Preface
Simplicity in nature is the ultimate sophistication. Honey bees are not able to play chess or solve the Tower of Hanoi puzzle; however, they do know how to build, defend, forage, navigate, and communicate for survival. They can even learn to recognize human letters independent of size, color, position, or font. Instinct is an inherited behavior that responds to particular environmental stimuli. In his book On the Origin of Species, Darwin pointed out that no complex instinct can possibly be produced through natural selection, except by the slow and gradual accumulation of numerous, slight, yet profitable, variations. Darwin also concluded that no one would dispute that instincts are of the highest importance to every animal species. The world’s magnificence has been enriched by the inner drive of instincts, perhaps the most profound drive of our everyday life. Instinctive Computing is a computational simulation of biological and cognitive instincts, which influence how we see, feel, appear, think, and act. If we want a computer to be genuinely secure, intelligent, and to interact naturally with us, we must give computers the ability to recognize, understand, and even to have primitive instincts. We aim to understand the instinctive common sense of living creatures, including the specialties of individual species as well. Instinctual systems will learn from insects, marine life, animals, and children, to broaden and develop their essential primitive thinking. “Computing with instincts” must be conceived as a meta-program, not a violent attack on traditional artificial intelligence. Instead, this is an aggressive leap for a robust, earthy, and natural intelligence. In the summer of 2009, the first Instinctive Computing Workshop (ICW 2009) was hosted at Carnegie Mellon University, Pittsburgh, USA, jointly sponsored by the National Science Foundation, Cylab, and Google. The two-day workshop aimed to explore the transformational developments in this area, including the building blocks for instinctive computing systems and potential applications in fields such as security, privacy, human–computer interaction, next-generation networks, and product design. The workshop was organized to engage in in-depth dialogue in a small group with multidisciplinary minds, returning to the origin of workshops to focus on ideas. This book, Computing with Instinct, comprises the proceedings of ICW 2009. It is the first state-of-the-art book on this subject. This book consists of three parts: Instinctive Sensing, Communication, and Environments. Part I. Instinctive Sensing. For many years, cyborg pioneer Warwick has explored neural behavior with bi-directional interactions between the brain and implanted devices, which he calls “Implantable Computing.” In this book, Warwick and his colleagues present their new experiments with culturing biological neurons in vitro for the control of mobile robots. Inherent operating characteristics of the cultured neural network have been trained to enable the physical
VIII
Preface
robot body to respond to environmental stimuli such as collisions. The 100,000 biological neurons are grown and trained to act as the brain of an interactive realworld robot – thereby acting as hybrid instinctive computing elements. Studying such a system provides insights into the operation of biological neural structures; therefore, such research has immediate medical implications as well as enormous potential in computing and robotics. This keynote chapter provides an overview of the problem area, gives an idea of the breadth of present ongoing research, details the system architecture and, in particular, reports on the results of experiments with real-life robots. Warwick envisioned this as a new form of artificial intelligence. Sound recognition is an invaluable primitive instinct for mammals. A recent archeological discovery suggested that, for over 120 million years, animals have developed an elaborated auditory system for survival. In the modern era, it is the most affordable diagnostic sensory channel for us, ranging from watermelon selection, car diagnosis to using a medical stethoscope. Cai and Pados explore an auditory vigilance algorithm for detecting background sounds such as explosion, gunshot, screaming, and human voices. They introduce a general algorithm for sound feature extraction, classification, and feedback. It is concluded that the new algorithm reaches a higher accuracy with available training data. This technology has potential in many broader applications of the sound recognition method, including video triage, healthcare, robotics, and security. About half of our brain cells are devoted to visual cognition. A texture provides instinctual cues about the nature of the material, the border, and the distance. The visual perception of texture is key to interpreting object surfaces. In Vernhes and Whitmore’s study, images of textured surfaces of prototype art objects are analyzed in order to identify the methods and the metrics that can accurately characterize slight changes in texture. Three main applications are illustrated: the effect of the conditions of illumination on perceived texture, the characterization of changes of objects due to degradation, and the quantification of the efficiency of the restoration. Part II. Instinctive Communication. Visual abstraction enables us to survive in complex visual environments, augmenting critical features with minimal elements – words. Cai et al. explore the culture and esthetic impacts on visual abstraction. Based on everyday life experience and lab experiments, they found that the factors of culture, attention, purpose, and esthetics help reduce the visual communication workload to a minimum. These studies involve exploration into multi-resolution, symbol-number, semantic differentiation, analogical and cultural emblematization aspects of facial features. To learn a genre is to learn the instinctual and cultural situations that support it. This dominant thinking overlooks critical aspects of genre that appear to be based in deep clusters within natural language lexicons that seem instinctual and cross-cultural. Hu et al. present a theory of lexical clusters associated with critical communication instincts. They then show how these instincts aggregate to support a substrate of conventional English writing genres. To test the crosscultural validity of these clusters, they tested Chinese students in rural China
Preface
IX
with limited training in native English writing and limited exposure to native English cultural situations. Non-verbal communication such as gestures and facial expressions is a major part of fundamental interaction among people. Sonntag views intuition as instinctive dialog. To allow for an intuitive communication, multimodal taskbased dialog must be employed. A concrete environment, where an intuition model extends a sensory-based modeling of instincts, can be used to assess the significance of intuition in multimodal dialog. Part III. Instinctive Environments. Rapidly growing virtual world technologies permit collaboration in a distributed, virtual environment. In a real-world environment, distributed teams collaborate via face-to-face communication using social interactions, such as eye contact and gestures, which provide critical information and feedback to the human decision maker. The virtual environment presents unique challenges in this regard. Yvonne and Aguiar focus on how we evaluate human performance and various levels of expertise, strategies, and cognitive processes of decision makers within the virtual environment. Their exploitations include accurate and time-critical information flow, cognitive workload, and situational awareness among team members. We are not living in the forest anymore. Modern living environments enable us to maximize comfort zones; however, they also introduce new problems associated with those artifacts. Garcia et al. study how to enable end-users to manage their preferences in personal environments. The system uses rules and modularizing agents, paying special attention to end-user programming issues and the natural hierarchies present in the environment. Furthermore, O’Grady et al. propose an intelligent middleware framework as a means for harnessing the disparate data sources necessary for capturing and interpreting implicit interaction events. The manifesto for ubiquitous computing was released in early 1990. Ten years later, ambient intelligence was envisioned. Today, how to implement networked intelligent artifacts remains an open issue. Human–computer interaction tries to combine psychology, computing, and design into a science. However, prevailing usability-centric studies have had little impact in real-world products or interactions. We need new genes, new dimensions, and new approaches. The goal of this book is to rethink the origin of human interactions, to define instinctual components, and to demonstrate the potential of such a new computing paradigm. We believe that “computing with instinct” is a solution for fundamental problems in ambient intelligence, such as situation awareness, understanding, learning, and simplicity. On behalf of the workshop committee and editing crew, I would like to thank all of the authors for their support for the book. Many thanks to Sylvia Spengler of the National Science Foundation, Pradeep Khosla, Adrian Perrig, Virgil Gligor, Howard Lipson, Richard Noland, Kristopher Rush, Willian Eddy, David Kaufer, Mel Siegel, and Richard Stafford of Carnegie Mellon University for their support. The Instinctive Computing Workshop was generously supported by the National Science Foundation, Google, and Cylab of Carnegie Mellon University.
X
Preface
The related projects have been in part sponsored by the US Army Research Office, Center for Emergency Response Team (CERT), and Air Force Research Lab in Rome, NY. However, the concepts in this book do not necessarily reflect the policies or opinions of any governmental agencies. Yang Cai
Organization
Organizers Yang Cai Sylvia Spengler Howard Lipson
Carnegie Mellon University National Science Foundation, USA CERT, Carnegie Mellon University
Program Committee Julio Abascal Xavier Alaman Jose Bravo Andrew Cowell David Farber Virgil Gligor Fabian Hemmert Michael Leyton Xiaoming Liu Yvonne Masakowski Adrian Perrig Mel Siegel Brenda Wiederhold Mark Wiederhold Brian Zeleznik
University of the Basque Country, Spain Autonomous University of Madrid, Spain Universidad de Castilla-La Mancha, Spain Pacific Northwestern National Laboratory, USA Carnegie Mellon University, USA Carnegie Mellon University, USA Deutsche Telekom Labs, Germany Rutgers University, USA GE Research Center, USA NAVY, USA Carnegie Mellon University, USA Carnegie Mellon University, USA Interactive Media Institute, Belgium Virtual Reality Medical Center, USA Carnegie Mellon University, USA
Editor Yang Cai
Editing and Design Assistant Emily Durbin
Coordinator Samantha Stevick
Table of Contents
Part I: Instinctive Sensing Experiments with an In-Vitro Robot Brain . . . . . . . . . . . . . . . . . . . . . . . . . . Kevin Warwick, Slawomir J. Nasuto, Victor M. Becerra, and Benjamin J. Whalley
1
Sound Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Cai and K´ aroly D. Pados
16
Texture Vision: A View from Art Conservation . . . . . . . . . . . . . . . . . . . . . . Pierre Vernhes and Paul Whitmore
35
Part II: Instinctive Communication Visual Abstraction with Culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Cai, David Kaufer, Emily Hart, and Yongmei Hu
47
Genre and Instinct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yongmei Hu, David Kaufer, and Suguru Ishizaki
58
Intuition as Instinctive Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Sonntag
82
Part III: Instinctive Environments Human Performance in Virtual Environments . . . . . . . . . . . . . . . . . . . . . . . Yvonne R. Masakowski and Steven K. Aguiar
107
Exploitational Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Garc´ıa–Herranz, Xavier Alam´ an, and Pablo A. Haya
119
A Middleware for Implicit Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.J. O’Grady, J. Ye, G.M.P. O’Hare, S. Dobson, R. Tynan, R. Collier, and C. Muldoon
143
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
163
Experiments with an In-Vitro Robot Brain Kevin Warwick1, Slawomir J. Nasuto1, Victor M. Becerra1, and Benjamin J. Whalley2 1
School of Systems Engineering, School of Chemistry, Food Biosciences and Pharmacy, University of Reading, UK {K.Warwick,S.J.Nasuto,V.M.Becerra,B.J.Whalley}@reading.ac.uk 2
Abstract. The controlling mechanism of a typical mobile robot is usually a computer system either remotely positioned or in-body. Recent research is on-going in which biological neurons are grown and trained to act as the brain of an interactive real-world robot – thereby acting as instinctive computing elements. Studying such a system provides insights into the operation of biological neural structures; therefore, such research has immediate medical implications as well as enormous potential in computing and robotics. A system involving closed-loop control of a mobile robot by a culture of neurons has been created. This article provides an overview of the problem area, gives an idea of the breadth of present ongoing research, details our own system architecture and, in particular, reports on the results of experiments with real-life robots. The authors see this as a new form of artificial intelligence.
1 Introduction In the last few years, considerable progress has been made towards hybrid systems in which biological neurons are integrated with electronic components. As an example, Reger [1] demonstrated the use of a lamprey brain to control a small wheeled robots movements; meanwhile, others were successfully able to send control commands to the nervous system of cockroaches [2] or rats [3] as if they were robots. These studies can inform us about information processing and encoding in the brains of living animals [4]. However, they do pose ethical questions and can be technically problematic since access to the brain is limited by barriers such as the skin and skull, and data interpretation is complicated due to the sheer number of neurons present in the brain of even the simplest animal. Coupled with this, approaches which involve recording the activity of individual neurons or small populations of neurons are limited by their invasive, and hence destructive, nature. As a result, neurons cultured under laboratory conditions on a planar array of non-invasive electrodes provide an attractive alternative with which to probe the operation of biological neuronal networks. Understanding neural behaviour is certainly extremely important in establishing better bi-directional interactions between the brain and external devices. On top of this, for neurological disorders, establishing improved knowledge about the fundamental basis of the inherent neuronal activity is critical. A robot body can potentially move around a defined area and the effects within a biological brain, which is Y. Cai (Ed.): Computing with Instinct, LNAI 5897, pp. 1–15, 2011. © Springer-Verlag Berlin Heidelberg 2011
2
K. Warwick et al.
controlling the body, can be witnessed. This opens up the possibility of gaining a fundamental appreciation and understanding of the cellular correlates of memory and resultant actions based on learning and habit. Research has recently been focused on culturing networks of some tens of thousands of brain cells grown in vitro [5]. These cultures are created by enzymatically dissociating neurons obtained from foetal rodent cortical tissue and then culturing them in a specialised chamber, in doing so providing suitable environmental conditions and nutrients. An array of electrodes is embedded in the base of the chamber (a Multi Electrode Array; MEA), providing an electrical interface to the neuronal culture [6-9]. The neurons in such cultures begin to spontaneously branch out and, within an hour of placement, even without external stimulation, they begin to re-connect with other nearby neurons and commence electrochemical communication. This propensity to spontaneously connect and communicate demonstrates an innate tendency to network. Studies of neural cultures demonstrate distinct periods of development defined by changes in activity which appear to stabilise after 30 days and, in terms of useful responses, last for at least 2-3 months [10, 11]. The cultures of neurons form a monolayer on the MEA, making them both amenable to optical microscopy and accessible to physical and chemical manipulation [9]. The specific aim of the ongoing project described here is to investigate the use of cultured neurons for the control of mobile robots. However, in order to produce useful processing, we postulate that disembodied biological networks must develop in the presence of meaningful input/output relationships as part of closed loop sensory interaction with the environment. This is evidenced by animal and human studies which show that development in a sensory-deprived environment results in poor or dysfunctional neural circuitry [13, 14]. To this end, the overall closed loop hybrid system involving a primary cortical culture on an MEA and a mobile robot body must exist within a sufficiently rich and reasonably consistent environment. This then constitutes an interesting and novel approach to examining the computational capabilities of biological networks [15]. Typically, in vitro neuronal cultures consist of thousands of neurons generating highly variable, multi-dimensional signals. In order to extract components and features representative of the network’s overall state from such data, appropriate preprocessing and dimensionality reduction techniques must be applied. Several schemes have till now been constructed. Shkolnik created a control scheme for a simulated robot body [16] in which two channels of an MEA were selected and an electrical stimulus consisting of a +/-600 mV, 400 μsecs biphasic pulse was delivered at varying inter-stimulus intervals. Information coding was formed by testing the effect of electrically-induced neuronal excitation with a given time delay termed the InterProbe Interval (IPI) between two stimulus probes. This technique gave rise to a characteristic response curve which formed the basis for deciding the robot’s direction of movement using basic commands (forward, backward, left and right). In one experiment with a simulated rat [32] as the embodiment, this moving inside a four-wall environment included barrier objects. Meanwhile, physical robots were used in an experiment [16] wherein one of the robots was able to maintain a constant distance from a second robot, which was moving under pseudo-random control. It was reported that the first robot managed to successfully approach the second and maintain a fixed distance from it. Information on the spontaneous activity of the
Experiments with an In-Vitro Robot Brain
3
culture was sent to a computer which then made the binary decisions as to what action the robot should take. The culture itself was not directly controlling the Koala through a feedback loop and no learning effect was reportedly exploited. In contrast with these experiments, both closed-loop control and learning are central aims in our own study. DeMarse and Dockendorf investigated the computational capacity of cultured networks by implementing the control of a “real-life” problem, namely controlling a simulated aircrafts flight path (e.g. altitude and roll adjustments) [17]. Meanwhile, Shahaf and Marom [18] reported one of the first experiments to achieve desired discrete output computations by applying a simple form of supervised learning to disembodied neuronal cultures. Recently, Bull & Uroukov [19] applied a Learning Classifier System to manipulate culture activity towards a goal level using simple input signals. In both of these latter experiments, the desired result was only achieved in about one third of the cases, indicating some of the difficulties in achieving repeatability. But this is a field of study very much in its infancy. There are bound to be difficulties; however, there is much to be learnt. It is apparent that, even at such an early stage, such re-embodiments (real or virtual) have an important role to play in the study of biological learning mechanisms and neurological behaviour in general. Our physical embodied robots provide the starting point for creating a proof-of-concept control loop around the neuronal culture and a basic platform for future – more specific – reinforcement learning experiments. The fundamental problem is the coupling of the robot’s goals to the culture’s input-output mapping. The design of the robot’s architecture discussed in this paper therefore emphasises the need for flexibility and the use of machine learning techniques in the search of such coupling. In the section which follows, the general procedure for laying out the neural culture (the biological component) is described. This is followed by a description of the main elements of the closed loop control system, including the culture as an important element in the feedback loop. Details of the current systems architecture are given in section 3. Section 4 includes a description of our initial tests and preliminary results. Section 5 meanwhile provides an explanation of the Machine Learning (ML) context, and Section 6 concludes with an overview of current progress. Finally, Section 7 discusses new, ongoing research and planned future extensions.
2 Culture Preparation To realise the cultured neural network, cortical tissue is dissected from the brains of embryonic rats and neuronal cells enzymatically dissociated before seeding onto planar Multi Electrode Arrays (MEAs). The cells are restricted to lie within the recording horizon of the electrode array by means of a template placed on the MEA prior to seeding and removed immediately after cells have settled (~ 1 hour). The MEA is also filled with a conventional cell culture medium containing nutrients, growth hormones and antibiotics, of which 50% is replaced twice weekly. Within the first hour after seeding, neurons appear to extend connections to nearby cells (even within the first few minutes this has been observed) and, within 24 hours, a thick mat of neuronal extensions is visible across the seeded area.
4
K. Warwick et al.
The connectivity between seeded cells increases rapidly over subsequent days. After 7 days, electrical signals are observed in the form of action potentials which, in the ‘disembodied culture’ (not connected within the closed loop), transform into dense bursts of simultaneous electrical activity across the entire network over the following week. This bursting feature subsequently continues through to maturity (30 days in vitro and onwards). It is not well understood what the bursting actually means and how much it is part of normal neural development. However, such continued behavior, after this initial development phase, may subsequently be representative of an underlying pathological state resulting from impoverished sensory input and may differ from the activity of a culture developing within a closed loop [20]. This is something which remains to be studied further. Cultures usually remain active until approximately 3 months of age. During this time, they are sealed with Potter rings [21] to maintain sterility and osmolarity and are maintained in a humidified, 37oC, 5% CO2 incubator. Recordings are undertaken in a non-humidified 37oC, 5% CO2 incubator for between 30 minutes and 8 hours dependent on environmental humidity and the resulting stability of activity.
3 Experimental Arrangements The multi-electrode array enables voltage fluctuations in the culture (relative to a reference ground electrode outside the network) to be recorded in real-time at 59 sites out of 64 in an ‘8x8’ array (Figure 1). This allows for the detection of neuronal action potentials within a 100 µm radius (or more) around an individual electrode. By using spike-sorting algorithms [12], it is then possible to separate the firings of multiple individual neurons, or small groups of neurons, as monitored on a single electrode. As a result, multi-electrode recordings across the culture permit a picture of the global activity of the entire neuronal network to be formed. It is possible to electrically stimulate via any of the electrodes to induce focused neural activity. The multielectrode array therefore forms a functional and non-destructive bi-directional interface to the cultured neurons. Electrically-evoked responses and spontaneous activity in the culture (the neuronal network) are coupled to the robot architecture, and hence on to the physical robot via a machine-learning interface, which maps the features of interest to specific actuator commands. Sensory data fed back from the robot is associated with a set of appropriate stimulation protocols and is subsequently delivered to the culture, thereby closing the robot-culture loop. Thus, signal processing can be broken down into two discrete sections: (a) ‘culture to robot’, in which an output machine learning procedure processes live neuronal activity, and (b) ‘robot to culture’, which involves an input mapping process, from robot sensor to stimulus. It is important to realise that the overall system employed in this experiment has been designed based on a closed-loop, modular architecture. As neuronal networks exhibit spatiotemporal patterns with millisecond precision [22], processing of these signals necessitates a very rapid response from neurophysiological recording and robot control systems. The software developed for this project runs on Linux-based workstations that communicate over the Ethernet via fast server-client modules, thus providing the necessary speed and flexibility required.
Experiments with an In-Vitro Robot Brain
5
In recent years, the study of neuronal cultures has been greatly facilitated by commercially available planar MEA systems. These consist of a glass specimen chamber lined with an 8x8 array of electrodes, as shown in Figure 1. It is just such one of these MEAs that we have employed in our overall robot system.
Fig. 1. a) An MC200/30iR-gr MEA (NMI, Reutlingen, Germany, UK), showing the 30µm electrodes which lead to the electrode column–row arrangement b) Electrode arrays in the centre of the MEA seen under an optical microscope (Nikon TMS, Nikon, Japan), x4 magnification c) An MEA at x40 magnification, showing neuronal cells in close proximity to an electrode, with visible extensions and inter-connections
A standard MEA (Figure 1a) measures 49 mm x 49 mm x 1 mm and its electrodes provide a bidirectional link between the culture and the rest of the system. The associated data acquisition hardware includes a head-stage (MEA connecting interface), 60 channel amplifier (1200x gain; 10-3200Hz bandpass filter), stimulus generator and PC data acquisition card. To this point, we have successfully created a modular closed-loop system between a (physical) mobile robotic platform and a cultured neuronal network using a MultiElectrode Array, allowing for bidirectional communication between the culture and the robot. It is estimated that the cultures employed in our studies consist of approximately (on average) 100,000 neurons. The actual number in any one specific culture depends on natural density variations in proliferation post-seeding and experimental aim. The spontaneous electrochemical activity of the culture realising signals at certain of the electrodesis used as input to the robots actuators and the robots (ultrasonic) sensor readings are (proportionally) converted into stimulation signals received by the culture, effectively closing the loop. We are using a versatile, commercially available, Miabot robot (Figure 2) as our physical platform. This exhibits accurate motor encoder precision (~0.5 mm) and has a maximum speed of approximately 3.5 m/s. Hence it can move around quite quickly in real time. Recording and stimulation hardware is controlled via open-source
6
K. Warwick et al.
MEABench software [23]. However, we have also developed our own custom stimulator control software, which interfaces with the commercially available stimulation hardware with no need for hardware modification [23]. The overall closed-loop system therefore consists of several modules, including the Miabot robot, an MEA and stimulating hardware, a directly linked workstation for conducting computationally expensive neuronal data analyses, a separate machine running the robot control interface, and a network manager routing signals directly between the culture and the robot body. The various components of the architecture communicate via TCP/IP sockets, allowing for the distribution of processing loads to multiple machines throughout the University of Reading’s internal network. The modular approach to the problem is shown in more detail in Figure 3. The Miabot is wirelessly controlled via Bluetooth. Communication and control are performed through custom C++ server code and TCP/IP sockets and clients running on the acquisition PC which has direct control of the MEA recording and stimulating software. The server sends motor commands and receives sensory data via a virtual serial port over the Bluetooth connection, while the client programs contain the closed loop code which communicates with and stimulates the MEA culture. The client code also performs text logging of all important data during an experiment run.
Fig. 2. The Miabot robot with a cultured neural network
Experiments with an In-Vitro Robot Brain
7
Fig. 3. Modular layout of the robot/MEA system
This modular approach to the architecture has resulted in a system with easily reconfigurable components. The obtained closed-loop system can efficiently handle the information-rich data that is streamed via the recording software. A typical sampling frequency of 25 kHz of the culture activity demands large network, processing and storage resources. Consequently, on-the-fly streaming of spike-detected data is the preferred method when investigating real-time closed-loop learning techniques.
4 Experimental Results Firstly, an existing appropriate neuronal pathway was identified by searching for strong input/output relationships between pairs of electrodes. Suitable input/output pairs were defined as those electrode combinations in which neurons proximal to one electrode responded to stimulation of the other electrode at which the stimulus was applied (at least one action potential within 100 ms of stimulation) more than 60% of the time and responded no more than 20% of the time to stimulation on any other electrode. An input-output response map was then created by cycling through all preselected electrodes individually with a positive-first biphasic stimulating waveform (600 mV; 100 µs each phase, repeated 16 times). By averaging over 16 attempts, it was ensured that the majority of stimulation events fell outside any inherent culture bursting that might have occurred. In this way, a suitable input/output pair could be chosen, dependent on how the cultures had developed, in order to provide an initial decision-making pathway for the robot.
8
K. Warwick et al.
To be clear about this initialisation process: In the initially developed culture, we found, by experimentation, a reasonably repeatable pathway in the culture from stimulation to response. We then employed this to control the robot body as we saw fit – for example, if the ultrasonic sensor was active, then we wished the response to cause the robot to turn away from the ultrasonically-located object being located ultrasonically in order to keep moving without bumping into anything. In the set-up, the robot followed a forward path within its corral confines until it reached a wall, at which point the front sonar value decreased below a threshold (set at approximately 30 cm), triggering a stimulating pulse as shown in Figure 4. If the responding/output electrode registered activity following the input pulse, then the robot turned to avoid the wall. Essentially, activity on the responding electrode was interpreted as a command for the robot to turn in order to avoid the wall. It was apparent that, in fact, the robot turned spontaneously whenever activity was registered on the response/output electrode. The most relevant result for the experiment was the occurrence of the chain of events: wall detection–stimulation–response. From a philosophical and neurological perspective, it is of course also of interest to speculate why there was activity on the response electrode when no stimulating pulse had been applied. The typical behaviour in the cultures studied was generally a period of inactivity (or low-frequency activity) prior to stimulus, followed by heightened network activity induced almost immediately (within few miliseconds) after stimulus, which decayed (typically after ~100 ms) to baseline pre-stimulus activity. The study opens up the possibility of investigating response times of different cultures under different conditions and how they might be affected by external influences such as electrical fields and pharmacological stimulants [24]. At any one time, we typically have 25 different cultures available, hence such comparative developmental studies are now being conducted. With the sonar threshold set at approx. 30 cm from a wall, a stimulation pulse was applied to the culture, via its sensory input, each time this threshold was breached – effectively, when the robots position was sufficiently close to a wall. An indication of the robots typical activity during a simple wall-detection/right-turn experiment is shown in Figure 4. The green trace indicates the front sonar value. Yellow bars indicate stimulus pulse times and blue/red bars indicate sonar timing/actuator command timing. As can be witnessed, these response events (single detected spike) may occur purely spontaneously or due to electric stimulation as a result of the sensor threshold being breached. Such events are deemed ‘meaningful’ only in the cases when the delay between stimulation and response is less than 100 ms. In other words, such an event is a strong indicator that the electric stimulation on one electrode caused a neural response on the recording electrode. The red vertical lines indicate the time that a rotation command is sent to the robot. These events are always coupled (the first one starts the right-turn rotation and the second simply ends the rotation). Only the second signals of each pair can be clearly seen here, as the rotation initiation commands are overlaid by the yellow electrode firing bars (a result of electrode firing which instantly initiates a rotation command). A ‘meaningful’ event chain would be, for example, at 1.95 s, where the sonar value drops below the threshold value (30cm) and a stimulation-response subsequently occurs.
Experiments with an In-Vitro Robot Brain
9
90 80 70 60 50 40 30
Fig. 4. Analysis of the robots activity during a simple wall-detection/right turn experiment
Table 1 contains typical results from a live culture test in comparison with a “perfect” simulation. If the live culture acted “perfectly,” making no mistakes, then the two columns would be identical. Of course, this raises the question as to what a “perfect” response actually is. In this case, it could be regarded as a programmed exercise – which some might refer to as “machine-like.” In a sense, therefore, the culture is asserting its own individuality by not being “perfect.” To explain Table 1 further, ‘total closed loop time’ refers to the time between wall detection and a response signal witnessed from the culture. ‘Meaningful turns’ refers to the robot turning due to a ‘wall detection-stimulation-response’ chain of events. A ‘wall to stimulation’ event corresponds to the 30 cm threshold being breached on the sensor such that a stimulating pulse is transmitted to the culture. Meanwhile, a ‘stimulation to response’ event corresponds to a motor command signal, originating in the culture and being transmitted to the wheels of the robot to cause it to change direction. It follows that, for the culture, some of the ‘stimulation to response’ events will be in ‘considered’ response to a recent stimulus – termed meaningful. In contrast, other such events – termed spontaneous – will be either spurious or in ‘considered’ response to some thought in the culture about which we are unaware. Table 1. Basic statistics from a wall avoidance experiment
Results Wall -> Stimulation event Stimulation -> Response event Total closed loop time Run time Meaningful turns Spontaneous turns
Simulation
Live Culture
100% 100% 0.075 s
100% 67% 0.2 - 0.5 s
240 s 41 41
140 s 22 16
10
K. Warwick et al.
By totalling the results of a series of such trials carried out (over 100), considerable differences (as typically indicated in Table 1) are observed between the ratio of expected and spontaneous turns between the simulation and the live culture. Under the control of the simulation 95 ± 4% (Mean ± SD) meaningful turns were observed whilst the remaining spontaneous turns (5 ± 4%) were easily attributable to aspects of thresholding spike activity. In contrast, the live culture displayed a relatively low number of meaningful turns (46 ± 15%) and a large number of spontaneous turns 54 ± 19% as a result of intrinsic neuronal activity. Such a large number of spontaneous turns was perhaps only to be expected in an uncharacterised system; current work aims to both quiet the level of ongoing spontaneous activity, reminiscent of epileptiform, present in such cultures and to discover more appropriate input sites and stimulation patterns. As a follow-up closed-loop experiment, the robots individual (right and left separately) wheel speeds were controlled by using the spike-firing frequency recorded from the two chosen motor/output electrodes. The frequency is actually calculated by means of the following simple principle: A running mean of spike rate from both the output electrodes was computed from the spike detector. The detected spikes for each electrode were separated and divided by the signal acquisition time to give a frequency value. These frequencies were linearly mapped (from their typical range of 0-100 Hz) to a range of 0-0.2 m/s for the individual wheel linear velocities. Meanwhile, collected sonar information was used to directly control (proportionally) the stimulating frequency of the two sensory/input electrodes. The typical sonar range of 0-100 cm was linearly re-scaled into the range 0.2-0.4 Hz for electrode stimulation frequencies (600 mV voltage pulses). The overall setup can be likened to a simple Braitenberg model [25]. However, in our case, sensor-to-speed control is mediated by the cultured network acting as the sole decision-making entity within the overall feedback loop. One important aspect being focused on is the evocation of Long Term Potentiation (LTP), i.e. directed neural pathway changes in the culture, thereby effecting plasticity between the stimulating-recording electrodes. Although this was not a major initial target in carrying out this part of the experiment, it has been noted elsewhere that a high frequency burst time can induce plasticity very quickly [27], [28]. As a result, we are now investigating spike-timing-dependent plasticity based on the coincidence of spike and stimulus.
5 Learning Inherent operating characteristics of the cultured neural network have been taken as a starting point to enable the physical robot body to respond in an appropriate fashion – to get it started. The culture then operates over a period of time within the robot body in its corral area. Experimental duration, e.g. how long the culture is operational within its robot body, is merely down to experimental design. Several experiments can therefore be completed within a day, whether on the same or differing cultures. The physical robot body can, of course, operate 24/7. In our studies thus far, learning and memory investigations are at an early stage. However, we were able to observe that the robot appeared to improve its performance over time in terms of its wall avoidance ability. We are currently investigating this
Experiments with an In-Vitro Robot Brain
11
and examining whether it can be repeated robustly and subsequently quantified. What we have witnessed could mean that neuronal structures/pathways that bring about a satisfactory action tend to strengthen purely though a process being habitually performed – learning due to habit. Such plasticity has been reported on elsewhere, e.g. [29], and experimentation has been carried out to investigate the effects of sensory deprivation on subsequent culture development. In our case we are monitoring changes and attempting to provide a quantitative characterisation relating plasticity to experience and time. The potential number of confounding variables, however, is considerable, as the subsequent plasticity process, which occurs over quite a period of time, is (most likely) dependent on such factors as initial seeding and growth near electrodes as well as environmental transients such as feed rate, temperature and humidity. On completion of these first phases of the infrastructure setup, a significant research contribution, it is felt, lies in the application of Machine Learning (ML) techniques to the hybrid system’s closed loop experiments. These techniques may be applied in the spike-sorting process (dimensionality reduction of spike data profiles, clustering of neuronal units); the mapping process between sensory data and culture stimulation, as well as the mapping between the culture activity and motor commands; and the application of learning techniques on the controlled electrical stimulation of the culture, in an attempt to exploit the cultured networks’ computational capacity.
6 Conclusions We have successfully realised a closed-loop adaptive feedback system involving a (physical) mobile robotic platform and a cultured neuronal network using a MultiElectrode Array (MEA), which necessitates real-time bidirectional communication between the culture and the robot. A culture being employed consists of approximately 100,000 neurons, although at any one time only a small proportion of these neurons are actively firing. Trial runs have been carried out with the overall robot and comparisons have been made with an “ideal” simulation which responds to stimuli perfectly as required. It has been observed that the culture on many occasions responds as expected; however, on other occasions it does not, and in some cases it provides a motor signal when it is not expected to do so. The concept of an ‘ideal’ response is difficult to address here because a biological network is involved, and it should not be seen in negative terms when the culture does not achieve such an ideal. We know very little about the fundamental neuronal processes that give rise to meaningful behaviours, particularly where learning is involved; we therefore need to retain an open mind as to a culture’s performance. The culture preparation techniques employed are constantly being refined and have lead to stable cultures that exhibit both spontaneous and induced spiking/bursting activity which develops in line with the findings of other groups, e.g. [15] and [21]. A stable robotic infrastructure has been set up, tested, and is in place for future culture behaviour and learning experiments. This infrastructure could be easily modified in order to investigate culture-mediated control of a wide array of alternative robotic
12
K. Warwick et al.
devices, such as a robot head, an ‘autonomous’ vehicle, robotic arms/grippers, mobile robot swarms and multi-legged walkers. In terms of robotics, this study and others like it, show that a robot can have a biological brain to make its ‘decisions’. The 100,000 neuron size is due to present day limitations – clearly this will increase. Indeed, it is already the case that 3-dimensional structures are being investigated [19]. Simply increasing the complexity from 2 dimensions to 3 dimensions (on the same basis) realises a figure of 30 million neurons (approx.) for the 3-dimensional case. The whole area of research is therefore a rapidly expanding one as the range of sensory inputs is expanded and the number of cultured neurons encapsulated rises. The potential capabilities of such robots, including the range of tasks they can perform, therefore needs to be investigated. Understanding neural activity becomes a much more difficult problem as the culture size is increased. Even the present 100,000 neuron cultures are far too complex at the moment for us to gain an overall insight. When they are grown to sizes such as 30 million neurons and beyond, clearly the problem is significantly magnified, particularly with regard to neural activity in the centre of a culture volume, which will be (effectively) hidden from view. On top of this, the nature of the neurons may be diversified. At present, rat neurons are employed in our studies. Potentially, however, any animal neurons could be used; even human neurons are not out of the question from a technical viewpoint. The authors wish to record our feelings here that it is important to stress the need for ethical concerns to be paramount in such circumstances.
7 Future Research There are a number of ways in which the current research programme is being taken forward. Firstly, the Miabot is being extended to include additional sensory devices such as extra sonar arrays, audio input, mobile cameras and other range-finding hardware, such as an on-board infrared sensor. This will provide an opportunity to investigate sensory fusion in the culture and perform more complex behavioural experiments, possibly even attempting to demonstrate links between behaviour and culture plasticity, along the lines of [29], as different sensory inputs are integrated. Provision of a powered floor for the robots corral will provide the robot with relative autonomy for a longer period of time while different learning techniques are applied and behavioural responses monitored. For this, the Miabot must be adapted to operate on an in-house powered floor, providing the robot with an unlimited power supply. This feature, which is based on an original design for displays in museums [30], is necessary since learning and culture behaviour tests will be carried out for hours at a time. Current hardcoded mapping between the robot goals and the culture input/output relationships can be extended by using learning techniques to eliminate the need for an a priori choice of the mapping. In particular, Reinforcement Learning techniques can be applied to various mobile robot tasks, such as wall following and maze navigation, in an attempt to provide a formal framework within which the learning capabilities of the neuronal culture will be studied. To increase the effectiveness of culture training beyond the ~30% success rate seen in previous work, biological experiments are currently being performed to identify
Experiments with an In-Vitro Robot Brain
13
physiological features which may play a role in cellular correlates of learning processes. These experiments also investigate possible methods of inducing an appropriate receptive state in the culture that may allow greater control over its processing abilities and the formation of memories [26] involving specific network activity changes which may allow identification of the function of given network ensembles. In particular, in terms of cholinergic influences, the possible effect of acetylcholine (ACh) [33] in coordinating the contributions of different memory systems is being investigated. A further area of research is to identify the most suitable stage of development at which to place cultures within the closed loop and whether a less pathological (epileptiform), therefore more effectively manipulated, state of activity is achieved when cultures are allowed to undergo initial development in the presence of sensory input. The learning techniques employed and the results obtained from the culture need to be benchmarked. In order to achieve this, we are developing a model of the cultured neural network based on experimental data about culture density and activity. In doing so, we hope to gain a better understanding of the contribution of culture plasticity and learning capacity to the observed control proficiency. Presently, we are investigating Hidden Markov Models (HMMs) as a technique for uncovering dynamic spatiotemporal patterns emerging from spontaneously active or stimulated neuronal cultures. The use of Hidden Markov Models enables characterisation of multi-channel spike trains as a progression of patterns of underlying discrete states of neuronal activity. Acknowledgements. This work is funded by the UK Engineering and Physical Sciences Research Council (EPSRC) under grant No. EP/D080134/1, with further financial support being provided by the Royal Society. The team wishes to thank the Science Museum (London), and in particular Louis Buckley, for their housed display explicitly on this work from October 2008 onwards. We also wish to thank New Scientist for their general popular coverage of our robot system in operation [31]. Finally, we wish to extend our gratitude to other members of the University of Reading team, namely Mark Hammond, Simon Marshall, Dimi Xydas, Julia Downes and Matthew Spencer.
References 1. Reger, B., Fleming, K., Sanguineti, V., Simon Alford, S., Mussa-Ivaldi, F.: Connecting brains to robots: An artificial body for studying the computational properties of neural tissues. Artificial Life 6, 307–324 (2000) 2. Holzer, R., Shimoyama, I., Miura, H.: Locomotion control of a bio-robotic system via electric stimulation. In: Proceedings of International Conference on Intelligent Robots and Systems, Grenoble, France (1997) 3. Talwar, S., Xu, S., Hawley, E., Weiss, S., Moxon, K., Chapin, J.: Rat navigation guided by remote control. Nature 417, 37–38 (2002) 4. Chapin, J., Moxon, K., Markowitz, R., Nicolelis, M.: Real-time control of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neuroscience 2, 664–670 (1999)
14
K. Warwick et al.
5. Bakkum, D.J., Shkolnik, A., Ben-Ary, G., DeMarse, T., Potter, S.: Removing Some ‘A’ from AI: Embodied Cultured Networks. Lecture Notes in Computer Science, pp. 130–145 (2004) 6. Thomas, C., Springer, P., Loeb, G., Berwald-Netter, Y., Okun, L.: A miniature microelectrode array to monitor the bioelectric activity of cultured cells. Exp. Cell Res. 74, 61–66 (1972) 7. Gross, G.: Simultaneous single unit recording in vitro with a photoetched laser deinsulated gold multimicroelectrode surface. IEEE Transactions on Biomedical Engineering 26, 273– 279 (1979) 8. Pine, J.: Recording action potentials from cultured neurons with extracellular microcircuit electrodes. Journal of Neuroscience Methods 2, 19–31 (1980) 9. Potter, S., Lukina, N., Longmuir, K., Wu, Y.: Multi-site two-photon imaging of neurons on multi-electrode arrays. In: SPIE Proceedings, vol. 4262, pp. 104–110 (2001) 10. Gross, G., Rhoades, B., Kowalski, J.: Dynamics of burst patterns generated by monolayer networks in culture. In: Neurobionics: An Interdisciplinary Approach to Substitute Impaired Functions of the Human Nervous System 1993, pp. 89–121 (1993) 11. Kamioka, H., Maeda, E., Jimbo, Y., Robinson, H., Kawana, A.: Spontaneous periodic synchronized bursting during the formation of mature patterns of connections in cortical neurons. Neuroscience Letters 206, 109–112 (1996) 12. Lewicki, M.: A review of methods for spike sorting: the detection and classification of neural action potentials. Network (Bristol) 9(4), R53 (1998) 13. Saito, S., Kobayashik, S., Ohashio, Y., Igarashi, M., Komiya, Y., Ando, S.: Decreased synaptic density in aged brains and its prevention by rearing under enriched environment as revealed by synaptophysin contents. Journal Neuroscience Research 39, 57–62 (1994) 14. Ramakers, G.J., Corner, M.A., Habets, A.M.: Development in the absence of spontaneous bioelectric activity results in increased stereotyped burst firing in cultures of dissociated cerebral cortex. Exp Brain Res 79, 157–166 (1990) 15. Chiappalone, M., Vato, A., Berdondini, L., Koudelka-Hep, M., Martinoia, S.: Network Dynamics and Synchronous Activity in cultured Cortical Neurons. International Journal of Neural Systems 17(2), 87–103 (2007) 16. Shkolnik, A.C.: Neurally controlled simulated robot: applying cultured neurons to handle an approach / avoidance task in real time, and a framework for studying learning in vitro, in Mathematics and Computer Science. Masters Thesis, Dept. of Computer Science, Emory University: Georgia (2003) 17. DeMarse, T.B., Dockendorf, K.P.: Adaptive flight control with living neuronal networks on microelectrode arrays. In: Proceedings of IEEE International Joint Conference on Neural Networks, IJCNN 2005, pp. 1549–1551 (2005) 18. Shahaf, G., Marom, S.: Learning in networks of cortical neurons. Journal Neuroscience 21(22), 8782–8788 (2001) 19. Bull, L., Uruokov, I.: Initial results from the use of learning classifier systems to control in vitro neuronal networks. In: Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation (GECCO), pp. 369–376. ACM, London (2007) 20. Hammond, M., Marshall, S., Downes, J., Xydas, D., Nasuto, S., Becerra, V., Warwick, K., Whalley, B.J.: Robust Methodology For The Study Of Cultured Neuronal Networks on MEAs. In: Proceedings 6th International Meeting on Substrate-Integrated Micro Electrode Arrays, pp. 293–294 (2008) 21. Potter, S.M., DeMarse, T.B.: A new approach to neural cell culture for long-term studies. Journal Neuroscience Methods 110, 17–24 (2001)
Experiments with an In-Vitro Robot Brain
15
22. Rolston, J.D., Wagenaar, D.A., Potter, S.M.: Precisely Timed Spatiotemporal Patterns of Neural Activity in Dissociated Cortical Cultures. Neuroscience 148, 294–303 (2007) 23. Wagenaar, D., DeMarse, T.B., Potter, S.M.: MEABench: A Toolset for Multi-electrode Data Acquisition and On-line Analysis. In: Proc. 2nd Int. IEEE EMBS Conf. Neural Eng., pp. 518–521 (2005) 24. Xydas, D., Warwick, K., Whalley, B., Nasuto, S., Becerra, V., Hammond, M., Downes, J.: Architecture for Living Neuronal Cell Control of a Mobile Robot. In: Proc. European Robotics Symposium EURO 2008, Prague, pp. 23–31 (2008) 25. Hutt, B., Warwick, K., Goodhew, I.: Emergent Behaviour in Autonomous Robots. In: Bryant, J., Atherton, M., Collins, M. (eds.) Information Transfer in Biological Systems. Design in Nature Series, vol. 2, ch. 14. WIT Press, Southampton (2005) 26. Hasselmo, M.E.: Acetycholine and learning in a cortical associative memory source. Neural Computation Archive 5, 32–44 (1993) 27. Cozzi, L., Chiappalone, M., Ide, A., Novellino, A., Martinoia, S., Sanguineti, V.: Coding and Decoding of Information in a Bi-directional Neural Interface. Neurocomputing 65/66, 783–792 (2005) 28. Novellino, A., Cozzi, L., Chiappalone, M., Sanguinetti, V., Martinoia, S.: Connecting Neurons to a Mobile Robot: An In Vitro Bi-directional Neural Interface. In: Computational Intelligence and Neuroscience (2007) 29. Karniel, A., Kositsky, M., Fleming, K., Chiappalone, M., Sanguinetti, V., Alford, T., Mussa-Ivaldi, A.: Computational Analysis In Vitro: Dynamics and Plasticity of a NeuroRobotic System. Journal of Neural Engineering 2, S250–S265 (2005) 30. Hutt, B., Warwick, K.: Museum Robots: Multi-Robot Systems for Public Exhibition. In: Proc. 35th International Symposium on Robotics, Paris, p. 52 (2004) 31. Marks, P.: Rat-Brained Robots Take Their First Steps. New Scientist 199(2669), 22–23 (2008) 32. DeMarse, T., Wagenaar, D., Blau, A., Potter, S.: The Neurally Controlled Animat: Biological Brains Acting with Simulated Bodies. Autonomous Robots 11, 305–310 (2001) 33. Chang, Q., Gold, P.: Switching Memory Systems during Learning: Changes in Patterns of Brain Acetylcholine Release in the Hippocampus and Striatum in Rats. Journal of Neuroscience 23, 3001–3005 (2003)
Sound Recognition Yang Cai and Károly D. Pados Carnegie Mellon University
[email protected]
Abstract. Sound recognition has been a primitive survival instinct of early mammals for over 120 million years. In the modern era, it is the most affordable sensory channel for us. Here we explore an auditory vigilance algorithm for detecting background sounds such as explosion, gunshot, screaming, and human voice. We introduce a general algorithm for sound feature extraction, classification and feedback. We use Hamming window for tapering sound signals and the short-term Fourier transform (STFT) and Principal Component Analysis (PCA) for feature extraction. We then apply a Gaussian Mixture Model (GMM) for classification; and we use the feedback from the confusion matrix of the training classifier to redefine the sound classes for better representation, accuracy and compression. We found that the frequency coefficients in a logarithmic scale yield better results versus those in linear representations in background sound recognition. However, the magnitude of the sound samples in a logarithmic scale yields worse results versus those in linear representations. We also compare our results to that of the linear frequency model and the Melscale Frequency Cepstral Coefficients (MFCC)-based algorithms. We conclude that our algorithm reaches a higher accuracy with available training data. We foresee broader applications of the sound recognition method, including video triage, healthcare, robotics and security. Keywords: audio, sound recognition, event detection, sound classification, video analytics, MFCC, sound spectrogram.
1 Introduction The middle ear is perhaps one of the most sensitive organs in all 5400 known mammals. Separated from the jawbone, the middle ear enables mammals to sense surroundings while chewing food. Sensitive hearing made it possible for early mammals to coexist with the dinosaurs; it was literally a matter of life and death. To hunt small insects, the mammalian middle ear is sensitive to high-pitched noises like a mosquitos whine [33]. To avoid day-hunting dinosaurs, the mammalian ear is tuned for detecting very quiet sounds at night. Even today, most mammals prefer to come out after dark. A recent archeological discovery suggested that, for over 120 million years, having an elaborated auditory system that was well-adapted has been fundamental for mammals for their survival [34-35]. Sound recognition is a primitive instinct for mammals. In the modern era, it is the most affordable sensory channel for us, ranging from watermelon selection and car Y. Cai (Ed.): Computing with Instinct, LNAI 5897, pp. 16–34, 2011. © Springer-Verlag Berlin Heidelberg 2011
Sound Recognition
17
diagnosis to using a medical stethoscope. Taking asthma diagnosis as an example, the sound generated by asthma patients breathing is widely accepted as an indicator of disease activity [20-21]. Digital devices may revolutionize the patient’s care by monitoring airway diseases in real-time, including recording, transmission and recognition of tracheal breath sounds [22-23]. Robotics researchers have tried to simulate natural auditory vigilance in robots. For example, the robotic head can turn to the auditory source. In some video surveillance systems, the cameras can also pan, tilt and zoom to the auditory source at night. We call this listening-only mode ‘passive sensing’. Wu and Siegel, et al [19] developed a system that can recognize vehicles based on their sound signatures. They recorded sounds of various vehicles, built feature vectors based on spectrograms and Principle Component Analysis, and classified the vectors by defining a Euclidean distance measure to the center of each known class. On the other hand, to make the auditory vigilance more effective, we emit a sound to generate echo patterns. We call this “active sensing”; for example, an ultrasound sensor array is used on autonomously driving cars to detect nearby obstacles in real-time. Many animals use active sensing based on sound echoes, so called echolocation. Bats, dolphins and killer whales all use echolocation. It allows these animals to accurately determine not only the direction of the sound, but also the distance of the source by measuring the elapsed time between the signals transmission and the echoes reception. This inspires a renaissance of artificial ethology: What if we were capable of sensing like a bat? Online music retrieval is another motivation for sound recognition. Query by tapping (QBT) is a novel method based on the rhythm of songs [24]. The system captures the essential elements of rhythm, allowing it to be represented in textual form, which has well-established algorithms to tolerate tempo variations and errors in the input. The beauty of this invention is that it doesn’t require any special hardware or software - the input device is simply the space bar on the computer keyboard. Audio genre classification is also explored in [13], where the feature vector is built using statistical features to represent the “musical surface” of the audio; they also incorporate a discrete wavelet transform. They use the final 17-dimensional feature vector to classify the audio genre. Perhaps the more challenging task here is to understand, annotate, or search the ever-growing number of digital videos based on content. In many cases, we only need to know a rough class of the clips, called “video triage” before going down to visual and textual details, e.g., explosion, gunshot and screaming, which we annotate as dangerous, alarming or scary scenes. In addition, we only need to know the gender and a rough age range to classify persons in the video, so called “soft biometrics”. In this Chapter, we would like to focus on how to extract auditory features and classify them for video triage.
2 Our Algorithm Our algorithm contains three processes: feature extraction, classification and feedback. The feature extraction process includes sampling raw audio signals and transforming the auditory data into frequency domain feature vectors and compressing them into lower feature dimensions. The classification process includes a machine
18
Y. Cai and K.D. Pados
learning model. Just like humans, computers must be taught what sounds they should recognize. To be able to classify sounds in an unsupervised manner, we have to carefully select and prepare a training dataset and train the model with feature vectors that were extracted in the previous step. Finally, we must adjust the definition of classes according to the feedback from the training results. Figure 1 shows an overview diagram of our approach. In the following sections, we present the solution we have chosen for each of the above steps. In the end, we present our results. Training and Testing Input Signal
Normalization, Windowing
Feature Construction
Feature Compression
Feedback Class Regrouping
Statistical Classification
Output Fig. 1. Overview of our algorithm
3 Feature Extraction There are many methods for auditory feature extraction. For example, frequency analysis, wavelet decomposition, the popular Mel-frequency Cepstral Coefficients (MFCC), more complex feature vectors with additional statistical (such as spectral centroid and variance), and psychoacoustic (sharpness, pitch, loudness, etc.) components [5, 13]. Here we only use analytical features derived from the spectrum of audio frequency and strength. In this section, we focus on signal sampling, transformation and dimensional reduction. 3.1 Human-Like Auditory Sensing The outer ear leads sound pressure waves to the eardrum and in turn to the middle ear. The middle ear consists mainly of ear bones. It protects the inner ear by dampening sounds that are too loud, while at the same time it amplifies some frequencies up to a factor of 20, acting as a mechanical impedance matcher. Amplification is needed, since sound waves must enter the fluid contained in the inner ear after the ear bones and a large portion of the original energy gets lost at the interface of the liquid and air. The inner ear is where “hearing” actually occurs. Besides containing the organ for balance, the inner ear also houses the snail-shaped cochlea. The basilar membrane
Sound Recognition
19
inside the cochlea has different stiffnesses along its length and contains thousands of hair cells on its surface. The various stiffness levels of the basilar membrane cause different parts of it vibrate to different sound frequencies, bringing the hair cells to vibration at those places. For each vibration cycle, each hair cell emits a small pulse to the brain. The sum of all pulses, which are directly related to the frequencies contained in the waveform, is interpreted and perceived as a ‘sound’ by the brain. To summarize, one of the functions of the inner ear is to act as an organic frequency analyzer, directing different sound frequencies to specific receptors. In the end, human sound receptors react to frequencies instead of directly to the amplitude of the sound waves [29,30]. Also, according to Weber-Fechner’s law, the relationship between a stimulus physical magnitude and its perceived intensity is logarithmic, which can be described by the equation
p = k ⋅ ln
S S0
(1)
Here, S is the physical stimuli, p is its perceived intensity, k a context-dependant multiplicative factor and S0 is the threshold below which nothing is perceived. This relationship is valid for many of our sensory systems, such as for the feeling of weight, for vision, and, of course, hearing. We incorporate the above facts into our framework by analyzing the frequencies in the audio signal and scaling their magnitude logarithmically. Knowing WeberFechner’s law, taking the logarithm of the Fourier-transform of a signal simulates human hearing relatively closely. The goal of simulating human hearing is to make sure that a machine extracts the same kind of information from a sound wave that the human organ would, helping to build the same classes of sound that the human brain would interpret. 3.2 Signal Sampling
To represent an auditory signal, the first thought would be to use the Fourier transform. However, it assumes the signal has infinite length, which is not true in our case. The transform treats finite signals by infinitely repeating the same signal all over, which most often creates discontinuities in the signal as interpreted by the transform. In the resulting frequency analysis, this leads to unwanted effects, like the Gibbs phenomenon, which manifests as overshoot and ripples that surround the discontinuity during reconstruction. Ultimately, the problem is trying to represent a non-bandlimited signal using a finite number of infinite-length basis functions. Therefore, tapering or windowing is necessary to sample the signal. A smooth windowing function will make the signals ends connect smoothly by band-limiting the signal, minimizing the unwanted artifacts of the transform [29]. Here we use the Hamming window to sample the auditory signal slices. To somewhat compensate for the information loss caused by the Hamming window, there is an overlap of 128 samples between each time slice. All of those slices together result in approximately 0.5 sec of audio. Our empirical tests show that, as reference [6] has stated, taking approximately half a second of audio in total results in the best
20
Y. Cai and K.D. Pados
*
=
Fig. 2. Hamming window minimizes unwanted artifacts (a) untapered signal; (b) Hammingwindow; (c) tapered signal. Note that the ends connect more smoothly than in (a).
performance and that the success rate saturates then declines with longer periods of data. Since all our data has been resampled to 44100Hz, 1024 samples for a single slice represent approximately 23ms of audio, which is in accordance with the works of others [1, 4, 5]. We construct the Hamming window as defined by [17] as eq. (2). The next step is to transform each of these slices of audio.
w(n) = 0.54 + 0.64cos(
π ⋅ 2n ) N −1
(2)
3.3 STFT Spectrogram
The spectrogram describes a piece of audio using its frequency spectrum over time. It is basically the frequency analysis of the audio, taken many times on different but adjacent time segments [18]. Because the Fourier transform is taken on short periods of data, a smooth windowing function is necessary. This we have already done in the previous step using the Hamming window. Taking the Fourier transform of windowed data of short periods is often referred to as the short-time Fourier transformation (STFT). Thus, the discrete STFT is defined as F(u), where w(n) is the windowing function as described previously [4]. F(u) =
N −1
∑
w(n)f(n)e
− π ⋅ 2j⋅ u ⋅ n N
(3)
n =0
A spectrogram is a collection of STFTs of adjacent audio data. Spectrograms are often visualized using a 2D plot, with the horizontal axis as the time and the vertical axis as the frequency. The magnitude of a frequency at a specific time is represented by the intensity or color of that point. The following illustrations show the spectrogram of the first three seconds of a representant of each class in our training database. The axes are linearly scaled from 0 to 3 seconds and from 0 to 22050Hz, while the magnitudes are normalized and scaled logarithmically. Brighter colors signify frequency components of higher magnitudes. The preliminary feature vector is then the concatenation of the logarithmically scaled Fourier coefficients of each time slice in the spectrogram for 0.5 seconds.
Sound Recognition
Fig. 3. Spectrogram of an explosion sound
Fig. 4. Spectrogram of multiple gunshots
Fig. 5. Spectrogram of a female speaking
Fig. 6. Spectrogram of a male speaking
21
22
Y. Cai and K.D. Pados
Fig. 7. Spectrogram of a male screaming
3.4 Compressing the Feature Vector
After calculating the spectrogram, our feature vector has 24 slices, each with 512 elements. Thus the total length of our feature vector is 12288. Unfortunately, working with such huge features would require us to use a training database that is orders of magnitude larger. According to the Hughes phenomenon [8], the ratio between the number of training samples and the length of the feature vector must be high enough for a statistical classifier to function properly. As such, an optimal length for the final feature vector has to be chosen. If it is too low, it cannot represent the features of the classes well enough, but if it is too high, noise or other statistically unimportant features will have too much influence. We have heuristically found that a feature vector length of 24 works well for our data. We have also used exactly the same number of time slices, but these two numbers are not related. The reduction of the size of the feature vector is done in two steps. For each time slice of Fourier coefficients, we first compress each slice into 24 floating point numbers. This is done by tiling the frequency axis into 24 equidistant bins and summing up each coefficient into the corresponding bin. To further reduce the dimensionality across time segments, we implement principal component analysis (PCA). PCA transforms a data set into a coordinate system where the first component (axis) of the data set corresponds to most of the variance in the data [10]. Each succeeding component accounts for the remaining variability in a decreasing manner. PCA thus decorrelates the data and can be used to reduce the dimensionality by keeping only the first few components. This implies information loss, but in such a way that the statistically most important features are still kept [2,10]. In our implementation, we keep only the very first component, called the principal component, and transform our 24×24 vector elements into a single set of 24 elements using this component. This is our final feature vector. Assume we have data set X represented by a M × N matrix containing N sets of M observations, and we want to perform principal component analysis on X. Our data is first mean-adjusted across each dimension. N
xm =
∑
1 ⋅ X m, n N n =1
(4)
Sound Recognition
A m, n = X m, n − x m
for all m, n
23
(5)
Then we construct the M × M covariance matrix C, with each element Ci,j containing the covariance of the data in dimensions i and j of A. Let I and J denote the vectors of data in the appropriate dimensions. M
cov(I, J) =
∑ (I k =1
k
− I)(J k − J )
(6)
M −1
From the covariances another matrix V that contains the eigenvectors of C is computed. These eigenvectors are sorted in descending order based on the associated eigenvalues. Eigenvalues can be found by Reighley quotient iteration and are in turn used to get the eigenvectors using Gaussian elimination. The eigenvector with the highest eigenvalue is called the principal component of our data. Since we are only interested in the principal component in our implementation, we keep only this component and transpose and normalize it to unit length. Treating it as a 1-row matrix V’, we finally use it to transform our data into a single M-element vector. This mathematically projects our data set into the space represented by the principal component, formulated as: Y = V' × A
(7)
where A is the mean-adjusted data set, V’ is the principal component, and × is matrix multiplication. Here is a pseudo code of the algorithm for extracting features of audio: 1. Tile audio signal data into slices with overlap 2. Zero offset 3. Normalize by dividing by the greatest magnitude 4. For each slice a. Apply Hamming-window b. Compute logarithm of power spectrum c. Compress using frequency binning 5. Concatenate processed slices 6. Compress using principal component
The resulting feature vector can be used to train a classifier in the training phase. Alternatively, after training, the feature vector can be used to classify new audio samples.
4 Sound Classification There are many kinds of classifiers for sound recognition. Support Vector Machines [25] are a common choice, as well as k-Nearest Neighbor schemes [19], neural networks [31] or Hidden Markov Models [3]. Not all of these methods return a probability value for the result of the classification. It is advantageous not only to be able to classify sound data, but to be able to detect whether a pattern we are interested in is
24
Y. Cai and K.D. Pados
present or not. Gaussian Mixture Models are also widely used in signal processing. They calculate the probability value as a confidence indicator. Based on this indicator, we should be able to reject a feature vector and tell that it does not belong to any of the predefined and trained classes if this value is too low. Here we chose a Gaussian Mixture model for our implementation. 4.1 The Classification Model
Classification problems are often referred to as part of clustering. The goal is, given a point in high-dimensional space (e.g., a feature vector), to find the class of known points it belongs to where a class of points is represented by potentially-multiple clusters. Clusters are also called components and are described by their statistical properties [12]. A single cluster is mathematically represented by its parameterized statistical distribution, most commonly with a Gaussian distribution in the continuous case. The entire data in a class is thus represented as a mix of these Gaussian distributions, hence the name Gaussian mixture model.
Class 1 Class 2 f(Y) = f1(Y) + f2(Y) Component 1 f1(Y)
Component 2
f2(Y)
Feature space
Density function of a class
Fig. 8. (a) A set of points in 2D space generated by a Gaussian mixture. Each cluster can be composed of multiple components. (b) The probability density function f of a 2-component Gaussian mixture.
Assume that Z is a set of feature vectors and d is the dimensionality of a single vector Y (Y ∈ R d ) all belonging to the same class, built as described above. If there are K components, component k can be parameterized by its mean µ k and its covariance matrix C k [12].
f k (Y) = ϕ (Y | μ k , C k ) =
1 (2π ) d C k
e
− (Y − μ k ) T C k−1 (Y − μ k ) 2
(8)
Then the mixture density, where each component’s weight ak is described in reference [11].
Sound Recognition
25
K
f(Y) =
∑a f
k k (Y)
(9)
k =1
To find the parameters of the clusters, given the feature vectors from the training set and the classes they belong to, is the job of the training procedure. The goal is to find the parameters of the unknown distributions for each class so that they maximize the probability of the data of the same class [11]. Formally, for data set Z and parameter set θ find
θ ' = arg max p(Z | θ ) = arg max θ
θ
n
∏ p(Y
p
|θ)
(10)
p =1
The expectation maximization (EM) algorithm iteratively finds an approximate of θ given the above criteria and it can be used for other kinds of distributions too. In each iteration, the algorithm alternately performs either the E-step (expectation) or the Mstep (maximization). In the E-step, each data is given some probability of belonging to each of the clusters. In the M-step, the cluster parameters are recomputed based on all the data belonging to them, weighted by the probability of the data at hand belonging to a specific cluster. The process then starts over. It has been shown that the EM-algorithm converges [11] to a maximum. Initialization can be done randomly in simpler cases or with other algorithms like the k-Means algorithm. Initialization is important to let the algorithm converge to a global maximum. For more information on the EM-algorithm, see references [7,8,11]. The reference [12] has information about the EM-algorithm that is specific to Gaussian Mixture models. The implementation used by our application is described in reference [16]. 4.2 Classification Procedure
Classification is achieved by evaluating the probability of a feature vector for each of the clusters with trained parameters. The feature vector is assumed to belong to the cluster which produces the highest probability. For practical reasons and because of limitations of digital computers, the natural logarithm of probabilities is used in calculations instead of the true probabilities themselves. Multiplying multiple probability values together, which are smaller than 1.0 by definition, would quickly result in underflows. The property of the logarithm that reduces multiplication to addition helps resolve this problem. Detection is achieved by letting the application classify the test samples and also output the log-likelihood of the data belonging to the determined class. If this probability value is too low, the data is rejected and will not be considered as one of the known classes. However, care has to be taken when determining the threshold for the probability value, as there is no common global value suited for every case. The optimal threshold should be determined by hand and depends heavily on the clusters, the training database and the exact algorithms used.
26
Y. Cai and K.D. Pados
5 Feedback Model In many cases, we don’t know the acoustic correlations of sound classes. The classification results from machine learning may provide valuable feedback about the acoustic correlation. When possible, we can regroup the highly correlated classes to a new class with a common name. Audio classification results are often presented in the form of a confusion matrix, a table where the header of every row is the actual class of the audio and the header of every column is the class that the audio has been detected as. Given a reasonable classification procedure, such a confusion matrix can be used as a guiding means to collapse two classes if they are found positioned too close to each other in the chosen feature space. This is useful to recognize falsely assumed dissimilarity between those two classes. In a confusion matrix, the number of correctly classified samples accumulates in the matrix diagonal. Falsely classified ones will be found outside of the diagonal. If two classes indexed i and j are being processed separately although they wield very similar features, the majority of the errors will be found in j for i, and in i for j. Misclassification for any of those two against any other third class will be comparatively low. Using this method, we can specify a margin we call a collapsing threshold that, when reached, will cause the two classes to be collapsed into one. Let i and j be the indices of two arbitrary classes after a complete classification procedure. Furthermore, let R be the confusion matrix of n × n dimensions for n number of audio classes. The sum of row l of R can be defined as n
Sl =
∑R
(11)
l, k
k =1
Then we can define a Boolean expression B that, when it evaluates to true, causes classes i and j to be collapsed.
R j,i ⎧ R i, j ≥t ∧ ≥t ⎪ B i, j = ⎨ (Si − R i,i ) (S j − R j, j ) ⎪ Else ⎩
True
(12)
False
With t being the collapsing threshold, the above expression is to be evaluated for each class pair i and j, where i ≠ j.
6 Experiment Design We have set out to be able to classify five classes of audio: screams, explosions, gunshots, human female speech and human male speech. For each of these classes, we have collected at least a hundred audio files and, for each of the audio files, a single feature vector was constructed based on the beginning of the data. Data came from various sources in various formats and in general was not noise-free. This was to make sure that the classifier is trained on general data that can be used in practice. Sources include audio collection libraries, audio streams of videos from the video
Sound Recognition
27
sharing website YouTube, and audio streams of videos that were shot using common (unprofessional) handheld cameras. However, most of the speech samples originate from the TIMIT database [27]. The formats include compressed and uncompressed data, ranging from low-quality 8 bit, 11025 Hz sources to high-fidelity 16 bit, 48 kHz sources. To be able to decode all these formats we used the open-source FFmpeg libraries [28]. During the processing of the sound files, all were transformed and resampled into a common 64 bit floating point, 441001 Hz mono format to lower the complexity of the later stages in the application. This relieved us from having to code multiple code paths for processing and interpreting multiple sample formats. The conversion was done using the FFmpeg audio conversion API. Table 1 provides an overview of the number of files we use in each class. Of all the samples collected, roughly 80% from each class was used for training, and the remaining 20% was used for obtaining the classification results. The training and testing sets do not overlap. Table 1. Number of sound files in the database No. of files for training
No. of files for testing
Explosion
184
45
Gunshot
110
25
Female
199
55
Male
203
55
Scream
95
23
7 Results The classification results are presented on Table 2. We show confusion matrices. At the end of each row, the success rate for the detection of that class is summarized. Table 2. Spectrogram-based results
1
Scream
Explosion
Gunshot
Male
Female
Success
Scream
22
0
1
0
0
95.65%
Explosion
2
37
6
0
0
82.22%
Gunshot
5
7
13
0
0
52.00%
Male
2
1
0
52
0
94.54%
Female
2
0
1
0
52
94.54%
44.1kHz is a common sampling rate to represent sounds perceivable by most humans.
28
Y. Cai and K.D. Pados
Screams have an almost perfect classification rate. Differentiation between male and female speech is very good, with only six samples being misidentified out of 110. Explosions also get classified well, except in some cases where an explosion is mistaken for a gunshot. Gunshot is the class with the lowest overall success in all of our tests; the most frequent error was mixing them up with explosions, but on occasion they were mistaken for screams. However, without knowing the context it is often hard or impossible even for humans to differentiate explosions from gunshots. These two classes are very similar in their features, which is not surprising as both are the result of principally the same chemical and physical reactions. Using the feedback model from Section 5, we can revise the table such that explosions and gunshot form a new class called “blast”. Table 3. Results using collapsed classes Scream
Blast
Male
Female
Success
Scream
22
1
0
0
95.65%
Blast
7
63
1
0
90.00%
Male
2
1
52
0
94.54%
Female
2
1
0
52
94.54%
8 Comparisons with Other Methods 8.1 Log vs. Linear
As mentioned earlier, one step in building the feature vectors was taking the logarithm of the power frequency magnitudes. In this experiment we conclude that logarithmic scaling does indeed produce better results when compared to linear scaling, as can be seen from Table 4. Table 4. Spectrogram-based results with linear magnitude scaling Scream
Explosion
Gunshot
Male
Female
Success
Scream
21
1
0
0
1
91.30%
Explosion
2
17
25
1
0
37.78%
Gunshot
1
7
13
4
0
52.00%
Male
0
1
2
48
4
87.27%
Female
1
2
1
15
36
65.45%
Logarithmically scaling the frequencies produces better results, because it resembles the way humans hear more closely. Knowing Weber-Fechner’s law, taking the logarithm of the Fourier-transform of a signal simulates human hearing relatively
Sound Recognition
29
Table 5. Spectrogram-based results with logarithmic magnitude scaling Scream
Explosion
Gunshot
Male
Female
Success
Scream
21
1
0
0
1
91.30%
Explosion
1
16
14
14
0
35.56%
Gunshot
2
7
11
5
0
44.00%
Male
1
2
0
50
2
90.91%
Female
0
2
1
19
33
60.00%
closely. The goal of simulating human hearing is to make sure that a machine extracts the same kind of information from a sound wave that the human organ would, helping to build the same classes of sound that the human brain would interpret. In a separate experiment, we also tried scaling the samples logarithmically instead of the power frequency. These results are also inferior to those listed in Table 2. 8.2 MFCC vs. Spectrogram-Based
For speech genres, MFCC is often used for constructing a feature vector. A great amount of research already went into inspecting for what purposes MFCC is adequate and determining the best way to perform the transformation. The study in reference [1] uses the discrete cosine transform as an approximation of the KL-transform to decorrelate the elements of a feature vector that is also MFCC-based for music. This study shows that MFCCs are not only suited for speech but also for music modeling. Although it does not claim that MFCCs are optimal for music, it does conclude that they are at least not harmful. The study in reference [9] tries to determine music similarity and additionally compares different MFCC implementation techniques. They conclude that, “with MFCCs based on fixed order, signal independent LPC, warped LPC, MVDR, or warped MVDR, genre classification tests did not exhibit any statistically significant improvements over FFT-based methods”. This leads to the conclusion that it is preferable to use FFT for the spectral estimation in MFCC for music similarity because of its performance advantage. The study in [4] also compares MFCCs but in the context of MP3 encoding quality. Their results show that the filter bank implementation of the MFCC is only an issue at low bitrates. Since MFCCs seem to perform well for both speech and music spectra, we extend this idea and try to use them for general sound pattern modeling too. The study in reference [5] explores the classification of not only some music genres, but also speech, noise and crowd noise comparing multiple feature sets. In [14], different sound patterns, like explosions and cries, are categorized using spectral properties, and a correlation model is used as the classifier. The study in reference [15] describes a content-based retrieval system for various sounds. As an alternative to the method described in Section 3, we produced an implementation based on Mel-scale Frequency Cepstral Coefficients (MFCC). MFCCs have been widely adopted in the field of describing features of human speech, speech
30
Y. Cai and K.D. Pados
recognition and voice identification as well as for geological purposes. It has been shown that they are even suitable for music classification [1,13]. Inspired by these results, we will see whether MFCC is adequate for the sound classes in our tests too. First, we define the cepstrum of a sound as the magnitude of the Fourier transform of the logarithmically scaled magnitude-spectrum of the original audio [1,4]. MFCC(n) = FFT(log FFT(w(n)f(n)) )
(13)
The Mel-scale cepstrum is given by transforming the frequency axis after the first FFT operation into Mel-scale, which is commonly given by [3,4].
ϕ = 2595 ⋅ log10 (
f + 1) 700
(14)
For the conversion into Mel-scale, different implementations exist and the exact conversion is most often only approximated. The common method is to define a set of band-pass filters, where the distance between the centers of each filter is based on the Mel-scale. As studies [9,4] show, implementations differ in many aspects, including the exact shape of the filters, the lower and upper bound frequencies, the magnitude of the filter centers and the number of bands in the filter bank. Some implementations place the bands in the lower frequency range on a linear estimation of the Mel-scale. In our implementation, all bands have equal height; they range from 20Hz to 22050Hz and we define 24 bands. This way, there is no need for explicitly binning the frequencies, as with the spectrogram approach, because that step is automatically done when converting to the Mel-scale. Table 6. MFCC-based results Scream
Explosion
Gunshot
Male
Female
Success
Scream
20
1
1
0
1
86.96%
Explosion
9
30
5
1
0
66.67%
Gunshot
2
14
9
0
0
36.00%
Male
0
2
1
51
1
92.72%
Female
0
0
1
2
52
94.54%
The use of Mel-scale Frequency Cepstral Coefficients resulted in marginally better results for human speech for clear samples only, giving considerably worse classification rates for the other classes. When we added noisy samples into the male and female classes, the results got slightly worse as can be seen from the tables. We conclude that MFCC is a useful tool when used appropriately, but it does not generate feature vectors suitable for all audio classes. For the classes tested in this work, our own feature vector implementation produces significantly better results.
Sound Recognition
31
9 Background Noise Issues In our context, we refer to unwanted sounds as noise. Noise is unwanted because it negatively affects the processing of our target auditory data. In general, automated processing equipment like computers cannot tell noise from the useful data, unless we have some knowledge about the nature of the noise itself. For classifications, for example, if the noise component of the data is too strong, there is a high possibility that the feature extraction process will construct features of the noise, or that the classifier will be trained to the nature of the unwanted data instead of the intended audio component. Samples collected from audio sources are in general not noise-free. Noise comes from many different sources, and it is possible to classify noise sources based on different criteria. One possible classification scheme, and maybe the most intuitive one, is the origin of the noise, like environmental noise (e.g., a crowd, wind), transmission channel noise (FM-broadcast), data corruption (digital files), recording equipment (imperfections of the hardware) or observation-inherent (sampling, quantization). Not all of these are relevant in every case: with current technology for example, quantization noise is rarely an issue nowadays. If the noise is statistically of a different nature than the audio that we originally target, it is also possible to filter out even relatively strong noise. A random background noise can be easily filtered out from periodic waveforms, for example. On the other hand, trying to filter out strong crowd noise from a single human’s speech can be very challenging. For sound classifier applications, if it can be foreseen that future data to be tested will not be noise-free in general and that the noise cannot be removed or is not practical to remove, it is important to also use noisy audio samples for the training procedure. This allows the noise to be trained into the classification framework, making it more immune and somewhat tolerant to the noise. In general, it is good practice to train the system on a relatively large number of sound samples even if it will only differentiate between a few classes. This is because of the huge variability of the audio samples, whose two largest contributors are variability in the source itself and noise. The amount of noise that can be tolerated will be highly case and- implementationdependant. It will depend on the number and kind of training samples used (and so indirectly on the class definitions also), the feature extraction implementation, and the exact classifier in the classification framework. The study in [25] does non-speech human sound classification and is tolerant to noise if the Signal-To-Noise Ratio (SNR) reaches approximately 50 dB. On the other hand, the speech recognition framework studied in [3] tolerates noise even with an SNR as low as 10 dB in their measurements. The study in [26] designs a noise-robust FFT-based feature vector that achieves success rates better than 90% with an SNR of 10 dB.
10 Conclusions Sound recognition is a basic instinct for mammals. In this Chapter, we explore how to enable auditory vigilance computationally. We use Hamming window to taper sound signals, the short-term Fourier transform (STFT) to extract the acoustic features and
32
Y. Cai and K.D. Pados
use principal component analysis (PCA) to compress the feature vector. Then we apply a Gaussian mixture model (GMM) with the expectation maximization (EM) algorithm for classification. Based on 203 test samples and 791 training samples, we successfully recognize blast (explosion or gunshot), screaming, human female voices and human male voices with accuracy over 90.00%. We believe that our results surpass previous studies in terms of high accuracy and robustness (real-world data). Second, to improve the sound representation and classification, we add a feedback channel from the confusion matrix of the training classifier so that we can collapse sound classes for better accuracy and compression. For example, by collapsing explosion and gunshot classes into one ‘blast sound,’ we not only reduce the number of classes, but also increase the accuracy from 52% to 90.00%. This also indicates that sound classes could be hierarchic. The more accurate the sound classifier, the better distinguished the resolution of lower level sounds. Our feedback model provides a measurable way to refine the class definitions in a hierarchy. We found that the frequency in a logarithm scale yields better results in background sound recognition, which is consistent to Weber-Fechner Law. However, the magnitude in a logarithm scale doesn’t yield better results versus linear representation. We also compare our results with other sound recognition methods, such as MFCC. We found that MFCC is good for representing human sounds and music. However, in this case, since we have a broader range of sound spectrum, our result is better than MFCC. Finally, we believe that sound recognition can go beyond auditory vigilance of anomalous sounds. It has potential in video triage, healthcare, robotics and security. Using a passive sensing mode, sound recognition is definitely an affordable watchdog.
Acknowledgement This research was supported by Center for Emergency Response Team at Carnegie Mellon and CyLab at Carnegie Mellon under grants DAAD19-02-1-0389 and W911NF-09-1-0273 from the Army Research Office. The authors would like to thank William Eddy and Emily Durbin for their comments and editing, Mel Siegel and Huadong Wu for their inspiring work on vehicle sound recognition and Rafael Franco for his outstanding rapid prototype that led to this project.
References 1. Logan, B., et al.: Mel Frequency Cepstral Coefficients for Music Modelling. Cambridge Research Laboratory (2000) 2. Lindsay, I., Smith, A.: Tutorial on Principal Components Analysis (2002) 3. Shannon, B.J., Paliwal, K.K.: A Comparative Study of Filter Bank Spacing for Speech Recognition. In: Microelectronic Engineering Research Conference (2003) 4. Sigurdsson, S., et al.: Mel Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3 Encoded Music, Technical University of Denmark (2006)
Sound Recognition
33
5. Breebaart, J., McKinney, M.: Features for Audio Classification, Philips Research Laboratories (2008) 6. Spina, M.S., Zue, V.W.: Automatic transcription of general audio data: Preliminary analysis. In: Proc. 4th Int. Conf. on Spoken Language Processing, Philadelphia, PA (1997) 7. Dellaert, F.: The Expectation Maximization Algorithm, College of Computing, Georgia Institute of Technology (2002) 8. Hsieh, P.-F., Landgrebe, D.: Classification of High Dimensional Data, Purdue University School of Electrical and Computer Engineering, ECE Technical Reports (1998) 9. Jensen, J.H., et al.: Evaluation of MFCC Estimation Techniques for Music Similarity 10. Shlens, J.: A Tutorial on Principal Component Analysis: Derivation, Discussion and Singular Value Decomposition (2003) 11. Bengio, S.: An Introduction to Statistical Machine Learning - EM for GMMs -, Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) slides (2004) 12. Li, J.: Mixture Models, Department of Statistics slides, The Pennsylvania State University (2008) 13. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Trans. Speech Audio Processing 10, 293–301 (2002) 14. Pfeiffer, S., Fischer, S., Effelsberg, W.: Automatic audio content analysis. Tech. Rep. No. 96-008, University of Mannheim (1996) 15. Foote, J.: Content-based retrieval of music and audio. Multimedia Storage and Archiving Systems II, 138-147 (1997) 16. Bouman, C.A.: CLUSTER: An Unsupervised Algorithm for Modeling Gaussian Mixtures, School of Electrical Engineering, Purdue University (2005) 17. Window function, http://en.wikipedia.org/wiki/Window_function (retrieved on 07/30/2010) 18. Spectrogram, http://en.wikipedia.org/wiki/Spectrogram (retrieved on 07/30/2010) 19. Siegel, M., et al.: Vehicle Sound Signature Recognition by Frequency Vector Principal Component Analysis. IEEE Trans. on Instrumentation And Measurement 48(5) (October 1999) 20. Spiteri, M.A., Cook, D.G., Clark, S.W.: Reliability of eliciting physical signs in examination of the chest. Lancet. 2, 873–875 (1988) 21. Pasterkamp, H., Kraman, S.S., Wodicka, G.R.: Respiratory sounds:advances beyond the stethoscope. American Journal of Respiratory Critical Care Medicine 156, 974–987 (1997) 22. Anderson, K., Qiu, Y., Whittaker, A.R., Lucas, M.: Breath sounds, asthma, and the mobile phone. Lancet. 358, 1343–1344 (2001) 23. Cai, Y., Abascal, J.: Ambient Intelligence in Everyday Life. LNCS (LNAI), vol. 3864. Springer, Heidelberg (2006) 24. Peter, G., Cukierman, D., Anthony, C., Schwartz, M.: Online music search by tapping. In: Cai, Y., Abascal, J. (eds.) Ambient Intelligence in Everyday Life. LNCS (LNAI), vol. 3864, pp. 178–197. Springer, Heidelberg (2006) 25. Liao, W.-H., Lin, Y.-K.: Classification of Non-Speech Human Sounds: Feature Selection and Snoring Sound Analysis. In: Proc. of the 2009 IEEE Int. Conf. on Systems, Man and Cybernetics (2009) 26. Chu, W., Champagne, B.: A Noise-Robust FFT-Based Spectrum for Audio Classification, Department of Electrical and Computer Engineering. McGill University, Montreal (2006) 27. TIMIT Acoustic-Phonetic Continuous Speech Corpus, Linguistic Data Consortium, University of Pennsylvania, http://www.ldc.upenn.edu/Catalog/ CatalogEntry.jsp?catalogId=LDC93S1
34
Y. Cai and K.D. Pados
28. FFmpeg, http://ffmpeg.org 29. Smith, S.W.: The Scientist & Engineer’s Guide to Digital Signal Processing. California Technical Pub. (1997) ISBN 0966017633 30. Hearing Central LLC, How the Human Ear Works, http://www.hearingaidscentral.com/howtheearworks.asp (retrieved on 10/25/2010) 31. Lee, H., et al.: Unsupervised feature learning for audio classification using convolutional deep belief networks. Stanford University, Stanford (2009) 32. Forero Mendoza, L.A., Cataldo, E., Vellasco, M., Silva, M.: Classification of Voice Aging Using Parameters Extracted from the Glottal Signal. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010. LNCS, vol. 6354, pp. 149–156. Springer, Heidelberg (2010) 33. Angier, N.: In Mammals, a Complex Journey. New York Times (October 13, 2009) 34. Ji, Q., Luo, Z.X., Zhang, X.L., Yuan, C.X., Xu, L.: Evolutionary Development of the Middle Ear in Mesozoic Therian Mammals. Science 9 326(5950), 278–281 (2009) 35. Martin, T., Ruf, I.: On the Mammalian Ear. Science 326(5950), 243–244 (2009)
Texture Vision: A View from Art Conservation Pierre Vernhes and Paul Whitmore Art Conservation Research Center, Department of Chemistry, Carnegie Mellon University, 700 Technology Drive, Pittsburgh, PA 15219 {pvernhes,pw1j}@andrew.cmu.edu
Abstract. The appreciation of many works of visual art derives from the observation and interpretation of the object surface. The visual perception of texture is key to interpreting those surfaces, for the texture provides cues about the nature of the material and the ways in which the artist has manipulated it to create the object. The quantification of texture can be undertaken in two ways: by recording the physical topography of the surface or by analyzing an image that accurately portrays the texture. For most art objects, this description of texture on a microscopic level is not very useful, since how those surface features are observed by viewers is not directly provided by the analysis. For this reason, image analysis seems a more promising approach, for in the images the surfaces will naturally tend to be rendered as they would when viewing the object. In this study, images of textured surfaces of prototype art objects are analyzed in order to identify the methods and the metrics that can accurately characterize slight changes in texture. Three main applications are illustrated: the effect of the conditions of illumination on perceived texture, the characterization of changes of object due to degradation, and the quantification of the efficiency of the restoration.
1 Introduction The appreciation of many works of visual art derives from the observation and interpretation of the object’s surface. The visual perception of texture is key to interpreting those surfaces, since the texture provides cues about the nature of the material and the ways in which the artist has manipulated them to create the object. But surface textures are not immutable qualities of an object. The surfaces can be portrayed differently depending on how they are illuminated during exhibition. Deterioration and damage can lead to alteration or loss of an object’s surface. Cracks may emerge on painted surfaces, metals may corrode, textiles may pill, stone may become worn or granular. These surface alterations threaten the aesthetic message of the artist. For this reason, conservators are often asked to restore a surface to some earlier condition or to stabilize the current state. Even the most delicate of treatments can further alter the surface. Some of the most extreme interventions, such as consolidation (the infusion of an adhesive in order to stabilize a very friable surface) can lead to a profound change in the appearance of the surface. A continuing challenge for the art conservation field is to develop treatments that are effective with minimal or acceptable changes to surface texture. This effort is made more difficult by the lack of Y. Cai (Ed.): Computing with Instinct, LNAI 5897, pp. 35–46, 2011. © Springer-Verlag Berlin Heidelberg 2011
36
P. Vernhes and P. Whitmore
an analytical method to quantitatively measure the appearance of surface texture and the changes induced by aging and treatment. Defining texture is not an easy task. As a matter of fact, there is no strict definition of texture, and each research field appropriates the word in a different way. Texture is often classified as complex visual patterns and subpatterns that have characteristic properties such as color, size, brightness, etc. [1]. Descriptions of the visual perception of texture are expressed in terms of roughness, regularity, glossy vs. matte, granulation, and so on. In addition, the textural properties of a surface are deeply related to the illumination conditions. Texture may be enhanced by grazing angle illumination, while diffuse illumination will tend to de-emphasize the surface topography. The quantification of texture can be undertaken in two ways: by recording the physical topography of the surface or by analyzing an image that accurately portrays the texture. For the former, 3D mapping of a surface using an optical profilometer or AFM is usually used when the surface quality requirements are extremely precise. For most art objects, this description of texture on a microscopic level is not very useful, since how those surface features are observed by viewers is not directly provided by the analysis. For this reason, image analysis seems a more promising approach, for in the images the surfaces will naturally tend to be rendered as they would when viewing the object. There exist numerous ways to perform image analysis on a collection of data. The four main categories of approaches to texture analysis are structural, statistical, model-based, and transform [2]. The choice of the set of analytical tools needed depends on the object and on the aim of the study. In this study, images of textured surfaces of prototype art objects were analyzed in order to identify the methods and the metrics that can accurately characterize slight changes in texture. As a case study, we investigated the effect of a dry-cleaning treatment on unpainted canvas. Three main applications are illustrated: the effect of the conditions of illumination on perceived texture, the characterization of changes of an object surface due to degradation, and the quantification of the texture changes resulting from conservation treatment.
2 Material and Methods 2.1 Experimental Setup The experimental device presented in Figure 1 allows for the control of the inclination of illumination. A digital camera is positioned normal to the surface of the sample to be examined (the camera is fixed). The sample itself is located on a rotating stage. Since the sample stage rotates, it is possible to fully describe the response of a sample to light according to both the inclination and the azimuthal position. Five inclination angles were chosen, 20°, 30°, 45°, 60° and 75°, respectively. The measurement of these angles is done by comparison to the position of the camera (ie: compared to the normal). A typical measurement involves 40 pictures, corresponding to 5 inclinations and 8 azimuthal positions of the lamps. The images were captured using the experimental setup presented in Figure 1. Image sizes were 2896 x 1944 pixels. The images were converted to gray scale in order to extract textural properties. The different algorithms and data treatments were coded using routines written with MATLAB V.8.
Texture Vision: A View from Art Conservation
Directional light spot
Digital camera
30°
45°
37
20°
60° 75°
Sample Rotating stage
Fig. 1. Experimental setup and the corresponding angular position of the lamps
2.2 Image Analysis Methods Non-destructive texture and surface quality analyses have applications for a large variety of material and fields. That is why the number of analytical tools aiming to describe surface texture is huge. Among the most popular are histogram analysis, autocorrelation, discrete image transform, ring/wedge filtering, Gabor filtering, and gray level co-occurrence matrices. Sonka et al [3] classify the different approaches by defining two families: statistical analysis and syntactic analysis. In a more elaborate categorization, Turceryan and Jain [1] distinguish four different approaches to texture analysis: statistical, geometrical, model-based, and signal processing approaches. 2.2.1 First-Order Histogram-Based Features The most natural way to analyze an image is to calculate its first-order histogram, from which parameters such as the mean, the variance, the standard deviation, the skewness or the kurtosis can be extracted. Despite their simplicity, histogram techniques demonstrate their usefulness in various applications. They are insensitive to rotation and translation. Although the first-order histogram does not carry any information on how the gray levels are spatially related to each other within the image, recent studies show that this simple method is capable of distinguishing matte from glossy textures. Motoyoshi et al. [4] demonstrated the close relationship between the asymmetry of the luminance distribution (the skew) and the perceived gloss or lightness of an image. 2.2.2 Gray Level Co-occurrence Matrix Another approach analyzes the second-order histogram of the image, the gray level co-occurrence matrix (GLCM) [5]. Co-occurrence probabilities provide a secondorder method for characterizing texture features. These probabilities represent the
38
P. Vernhes and P. Whitmore
conditional probabilities of all pair-wise combinations of gray levels in the spatial window of interest given two parameters: the interpixel distance or lag (δ) and the orientation (θ). Hence, the probability measure can be defined as: Pr( x ) = {C ij (δ , θ )}
(1)
where Cij is the co-occurrence probability between gray levels i and j and is defined as: Cij =
Pij G
(2)
∑P
i , j =1
ij
where Pij represents the number of occurrences of gray levels i and j within the image for a given pair (δ,θ), and G is the quantized number of gray levels. In order to reduce both the computing time and the noise of the signal the value of G is typically 8 or 16. Statistical parameters can be calculated for the GLCM matrix aiming to describe various textural properties. For example, the texture features include randomness, coarseness, linearity, periodicity, contrast and harmony. In this study, we mainly focused on the GLCM contrast, defined as: Contrast = ∑ Cij (i − j ) 2
(3)
Perceptually, an image is said to have a high contrast if areas of different intensity level are clearly visible. Hence, both the spatial frequency of change and the intensity difference between neighboring regions will affect the contrast. 2.2.3 Autocorrelation Function and Regularity Calculation The regularity of a pattern is of fundamental importance in the characterization of some textures, such as those of textiles or wood. A patterned texture has two meanings: the spatial relationship between pixel intensities and the repeat distance of repetitive units [6]. Several approaches have been developed to analyze the regularity of patterned structure. The GLCM was applied to image retrieval and defect detection ([7], [8]). Image subtraction [9], Gabor filtering [10] and Hash function were also investigated [11]. Spatial domain filtering is also a common tool for texture analysis. In this work, we focus on the autocorrelation function calculated from the Fourier transform in order to extract the regularity, following the work of Chetverikov ([12], [13]). Chetverikov developed a regularity calculation algorithm. For a given periodic structure, it is possible to calculate the regularity between 0 and 1 (1 describing a perfectly regular structure while 0 indicates a random texture). This work also demonstrated a close relationship between the definition of regularity and human perception. For extensive explanations of the different parts of the algorithm, see Chetverikov’s work ([12], [13]). Here we present only the main parts of the calculations. First, the power spectrum of the considered image is needed. The power spectrum
S (u, v) =
F (u, v) M2
2
where F is the Fourier transform of the image.
(4)
Texture Vision: A View from Art Conservation
39
The power spectrum is related to the areal autocorrelation function (AACF) thanks to the Wiener theorem, which states that the AACF and the power spectrum form a pair.
AACF(m, n) = IFFT[ FFT[ I (m, n) × FFT[ I (m, n)]]
(5)
where IFFT is the inverse transform of FFT (Fast Fourier Transform). To allow the calculation in the alternate direction, the AACF is normalized and then converted into a polar representation. The regularity is composed of two contributions: the intensity regularity (Rint) and the position regularity (Rpos). Rpos represents the periodicity of the layout of the elements composing the pattern, while Rint indicates the regularity and stability of the intensity of the elements. For each angular direction, the normalized AACF is extracted. According to the position of the extrema and their intensity, Rint is calculated. The distance between the various extrema allows for the quantification of Rpos. The regularity is then defined as: Regularity=max{(Rpos x Rint)2}
(6)
Figure 2 briefly summarizes the different stages of the calculation.
(a)
(b)
(c)
(d)
Fig. 2. The various steps of the calculation of the regularity. From the raw image (a), the Fourier spectrum is calculated (b). Then a polar and normalized AACF representation is calculated (c), from which the regularity position and intensity (d) are obtained by applying the Chetverikov algorithm.
40
P. Vernhes and P. Whitmore
3 Texture Variation Analysis of Canvas Due to Cleaning The range of application of texture analysis in the field of art conservation is great. In order to illustrate its potential and its usefulness, we selected a particular case study: the cleaning of unpainted canvas. As a result of aging or damage, unpainted canvas can become stained or discolored. Several techniques exist to clean stained textiles, such as unpainted canvas, without using liquids, including soft and hard sponge, eraser, and eraser crumb [14]. These dry treatments involve a certain amount of abrasion of the surface, which causes slight changes in the canvas texture. Two commercial cotton duck canvases were selected (referenced as Canvas 1 and 2). Their surfaces were scrubbed with a fine sandpaper to simulate the cleaning process at two different stages. The first stage was a gentle scrubbing while the second stage was stronger, reaching the point where the canvas threads were visibly damaged. 3.1 Effect on Regularity A careful visual inspection of the canvas surfaces showed a decrease in the regularity of the weave pattern due to the scrubbing. The physical strain due to the scrubbing on the surface tended to affect the arrangement of the fibers. As a result of the gentle scrubbing, the tension of the weave was loosened and we also noticed the emergence of slub. With stronger scrubbing, defects in the structure, such as holes and snagged fibers, were observed. In order to quantify these variations in pattern regularity, we computed the algorithm described in Part 2.2.3 for the different samples considered. Figure 3 presents the results. 0.8
Canvas 1 Canvas 2
0.7
Regularity
0.6 0.5 0.4 0.3 0.2 0.1 0
Untreated
Gently scrubbed
Strongly scrubbed
Fig. 3. Pattern regularity function of the cleaning stage for canvas 1 and 2
For both sets of samples, the regularity calculated using the Chetverikov algorithm ([12], [13]) decreases with the intensity of the scrubbing. Before cleaning, canvas 1 possesses a higher regularity than canvas 2. This difference is the reason why gentle scrubbing created a greater decrease in the pattern regularity for canvas 1 than for canvas 2. Furthermore, both canvases see their regularity decrease almost linearly with more abrasion. Both trends are in accord with results observed by visual
Texture Vision: A View from Art Conservation
41
inspection. Hence, it is possible to conclude that the regularity calculation proposed is suitable to characterize slight variations in pattern regularity. The next step of this study is to explore the variation of texture strength and variation of appearance of the canvas due to the cleaning. 3.2 Effect of Illumination and Cleaning Stage on Canvas Texture Most real surfaces exhibit a non-Lambertian behavior. In fact, the apparent brightness of the surface to an observer is strongly dependent on the position of the illumination (and the position of the viewer). For a highly patterned surface such as a textile, the effect of the illumination on the perceived brightness and pattern must be examined in detail. 3.2.1 Effect on Brightness The various sets of samples were imaged for the 40 positions of illumination. For each position, the average brightness was calculated (corresponding to the mean gray level of the image when converted to gray scale). One possibility to examine the effect of cleaning is to make a polar 3D plot of the variation of luminance according to the position of illumination (see Figure 4).
Fig. 4. Variation of the luminance due to gentle scrubbing for canvas 1 according to the position of illumination
42
P. Vernhes and P. Whitmore
The anisotropy of the structure is shown, the mean luminance depending on both the elevation and azimuthal position of the source (the brightness of an isotropic structure would have been invariant to the azimuthal position of the illumination). The lighter part of the graph corresponds to the position of the illumination where the scrubbing increased the brightness, while the darker part corresponds to a loss of luminance. Hence, we can notice a general increase of the mean luminance for high elevation angle from scrubbing, while the mean luminance for grazing angle illumination is decreased. This result has a simple origin. One of the obvious effects of scrubbing is to make the fibers more fuzzy and pilled. As a result, part of the weave structure that was empty becomes filled with fibrils. Consequently, a portion of the light that was traveling through the structure of the untreated canvas is now reflected back to the camera. On the other hand, scrubbing diminishes the height of the weave structure, and therefore the surface becomes flatter, with fewer asperities to reflect light incident at grazing angle. The main limitation of this method is that it does not actually give clues about the variation of texture or perceived texture. To overcome this limitation, the effect of illumination on the GLCM was investigated. 3.2.2 Effect on GLCM Contrast Figure 5 shows the variation of GLCM contrast as a function of the lateral offset for canvas 1 according to the elevation of the illumination. 2
GLCM Contrast
1.6 75°
1.2
60° 45° 30° 20°
0.8
0.4
0 0
10
20
30
Offset (in pixels)
40
50
Fig. 5. GLCM contrast of untreated canvas 1 for various elevation angles for the illumination
By computing the GLCM contrast as a function of the lateral displacement, the weave pattern is exhibited. The offset is chosen in pixels rather than in mm or micrometer for sake of simplicity, but the width of the undulation of the GLCM contrast corresponds precisely to the actual width of the threads. These results also demonstrate that illumination incident near grazing angle (larger elevation angle) tends to enhance the pattern of the canvas texture. For the elevation angle of 20°, very near normal, the undulation of the weave is barely noticeable. These results are again in accordance with the visual perception of the canvas surface. As with brightness, the
Texture Vision: A View from Art Conservation
43
GLCM contrast is deeply related to the position of the light source. In this context, the full characterization of this contrast was needed to determine the best geometry position to perform the measurement of texture change from cleaning. Figure 6 presents a 3D polar plot of the maximum GLCM contrast as a function of the illumination position.
Fig. 6. Maximum GLCM contrast on untreated canvas 2 according to the position of illumination
The lighter region corresponds to the position of the light source producing the largest values of GLCM contrast. These regions are expected to be more sensitive to the effect of cleaning. This quantification of the effect of illumination allows for the choice of the optimum lighting geometry for detecting texture changes due to cleaning. The figure indicates that the higher values of contrast are obtained for a large elevation angle (75°) and an azimuthal angle of 0° or 180°. However, in order to avoid masking and shadowing effects, we selected an elevation angle slightly less than optimal (60°). Having chosen the lighting geometry for the two canvases, Figure 7 presents the effect of the cleaning on the GLCM contrast for an elevation angle of 60° and the azimuthal position of 0°. The effect of the cleaning on the GLCM contrast is evident. For both canvas 1 and 2 we noticed a sharp decrease of the contrast after gentle scrubbing. Further contrast decrease following stronger scrubbing of the samples was slight. In addition, the undulation characteristics of the pattern are also attenuated due to the cleaning. As a result of the pilling of the threads the pattern is attenuated and the surface seems more
44
P. Vernhes and P. Whitmore
GLCM Contrast
0.8
0.4
Untreated Gently scrubbed Strongly scrubbed 0 0
10
20
30
40
50
Offset (in pixels)
Fig. 7. Effect of cleaning on the GLCM contrast as a function of the offset for canvas 1 (top) and canvas 2 (bottom)
homogeneous and has lower contrast. In order to provide a single metric to characterize the contrast variation, which carries the information about the texture pattern, we may consider the maximum amplitude variation of the GLCM contrast (see Figure 8). This parameter is defined as the difference between the maximum and minimum of the GLCM contrast.
Texture Vision: A View from Art Conservation
45
0.9
Canvas 1 Canvas 2
0.8
GLCM contrast amplitude
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Untreated
Gently scrubbed
Strongly scrubbed
Fig. 8. Effect of cleaning on the GLCM contrast amplitude variation
Using the statistics obtained from the co-occurrence matrix of images seems promising for quantifying the slight changes in perceived texture that can result from cleaning.
4 Conclusion This study focused on the extraction of textural properties of artworks from analysis of their visual properties. An experimental device was developed to study the effect of the illumination direction. As an example, the effect of surface modification induced by the cleaning of unpainted canvas on the visual perception was investigated. Three metrics were selected to characterize key features of the perceived canvas texture, namely the brightness, the regularity, and the contrast. As the cleaning of the surface increases the fuzziness of the fibers while decreasing its roughness, the observed brightness is modified. The mean luminances of the samples increase for illumination angles close to the normal, although the loss of roughness decreases diffuse reflection and leads to a decrease of the brightness for grazing angle illumination. The regularity calculation showed a progressive decrease for the two canvases with a greater degree of cleaning. The contrast was evaluated using the GLCM contrast and computed for different offset distances. In a first approach, we investigated the effect of the illumination direction on the contrast value. The illumination position giving the maximum contrast was selected. From the images captured under these lighting conditions, it was possible to quantify the effect of the scrubbing on both canvases. It appeared that the contrast dropped dramatically after the gentle scrubbing and then did not decrease significantly with further scrubbing. By combining these three metrics, it is therefore possible to precisely describe the effect of the cleaning on the perception of these surfaces. The application of image analysis tools devoted to art conservation science seems promising. Study of textural variation of particular materials such as stone, wood, or ceramic holds great potential in evaluating the effects of conservation treatments as well as tracking the changes produced during the degradation of object surfaces.
46
P. Vernhes and P. Whitmore
References 1. Tuceryan, M., Jain, A.K.: In: Chen, C.H., Pau, L.F., Wang, P.S.P. (eds.) Handbook of Pattern Recognition and Computer Vision, pp. 207–248. World Scientific, Singapore (1998) 2. Jain, A., Robert, P., Mao, J.: Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 4–37 (2000) 3. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision. Brooks/Cole, Pacific Grove (1998) 4. Motoyoshi, I., Nishida, S., Sharan, L., Adelson, E.H.: Image statistics and perception of surface qualities. Nature 447, 206–209 (2007) 5. Haralick, R.M.: Statistical and structural approaches to texture. Proc. IEEE 67, 786–804 (1979) 6. Ngan, H.Y.T., Pang, G.K.H., Yung, S.P., Ng, M.K.: Wavelet based methods on patterned fabric defect detection. Int. J. Pattern Recogn 38, 559–576 (2005) 7. Lin, H.C., Wang, L.L., Yang, S.N.: Regular-texture image retrieval based on textureprimitive extraction. Image Vis. Comput. 17, 51–63 (1999) 8. Kuo, C.F.J., Su, T.L.: Gray relational analysis for recognizing fabric defects. Textile Res. J. 73, 461–465 (2003) 9. Sandy, C., Norton-Wayne, L., Harwood, R.: The automated inspection of lace using machine vision. Mech. J. 5, 215–231 (1995) 10. Kumar, A., Pang, G.: Defect detection in textured materials using Gabor filters. IEEE Trans. Ind. Applicat. 38(2), 425–440 (2002) 11. Baykal, I.C., Muscedere, R., Jullien, G.A.: On the use of hash functions for defect detection in textures for in-camera web inspection systems. In: Proc. IEEE Int. Symp. Circuits Systems, ISCAS, vol. 5, pp. 665–668 (2002) 12. Chetverikov, D.: Pattern regularity as a visual key. Image Vis. Comput. 18, 975–985 (2000) 13. Chetverikov, D., Hanbury, A.: Finding defects in texture using regularity and local orientation. Pattern Recognit. 35, 2165–2180 (2002) 14. Esmay, F., Griffith, R.: An Investigation of Cleaning Methods for Untreated Wood. Postprints of the Wooden Artifacts Group of the American Institute for Conservation of Historic and Artistic Works, AIC, Washington, DC, pp. 56–64 (2004)
Visual Abstraction with Culture Yang Cai1, David Kaufer1, Emily Hart1, and Yongmei Hu2 1
Carnegie Mellon University, USA 2 Guizhou University, China
Abstract. Visual abstraction enables us to survive in complex visual environments by augmenting critical features with minimal elements – words. In this chapter, we explore how culture and aesthetics impact visual abstraction. Based on everyday life experience and lab experiments, we found that the factors of culture, attention, purpose and aesthetics can help reduce visual communication to a minimal footprint. As we saw with the hollow effect, the more familiar we are with an object, the less information we need to describe it. The Image-Word Mapping Model we have discussed allows us to work toward a general framework of visual abstraction in two directions, images to words and words to images. In this chapter, we present a general framework along with some of the case studies we have undertaken within it. These studies involve exploration into multi-resolution, symbol-number, semantic differentiation, analogical, and cultural emblematization aspects of facial features.
1 Introduction As ubiquitous computing enables sensor webs to become increasingly interconnected, the volume and complexity of information grows exponentially. Information overload becomes a constant problem. To address this overload, there has been an everincreasing demand for intelligent systems to navigate databases, spot anomalies, and extract patterns from seemingly disconnected numbers, words, images and voices. Visual information is a growing contributor to this overload, calling for intelligent systems that can make principled visual abstractions over mountains of visual data. Visual abstraction can be defined as a minimal description of an object, its relationship to other objects or to its dynamics. Visual abstraction must often be guided by culture and aesthetics as well as human psychology. Written language is the ultimate visual abstraction of how human verbal communication mediates mind and culture. If a picture is worth 10,000 words [11], can a word be worth 10,000 images? The answer is yes. Many referential expressions are abstract but still convey visual information. It would take many millions of digital bytes to register the look of disbelief on a human face. It takes a writer only a single referential phrase, “raised eyebrows,” to capture this look through a linguistic abstraction. In our everyday life, we detect, recognize and retrieve images instantly with words. The linguistic retrieval of images dramatically reduces the representational overhead of the communicative transaction. Sometimes simple pictorial similarities between single alphabetic letters (e.g., “T” or “X”) and complex visual images (e.g., a traffic intersection) supply effective visual abstractions as well. The Roman rhetoricians circa 90 BC were the first to formalize Y. Cai (Ed.): Computing with Instinct, LNAI 5897, pp. 47–57, 2011. © Springer-Verlag Berlin Heidelberg 2011
48
Y. Cai et al.
how mnemonic word patterns could serve as prompts to organize one’s memory and retrieve abstractions about prior situations that called for wisdom. They understood how seeding one’s memory with image-rich proverbs, maxims, adages, commonplaces, tropes and figures gives the orator a quick advantage in classifying previously unclassified situations, characters and norms. For better or worse, there are no slimmed-down algorithms to guide us toward optimal visual abstraction. As mentioned above, visual abstraction must often be guided by considerations of culture and aesthetics, matters far beyond the provenance of traditional computer science. By describing a traffic intersection with a letter ‘T’, or ‘X’, we compress a dense image (e.g. 1 megabyte) to a sparse letter (e.g. 1 byte). The plan for the rest of this chapter is to examine in greater detail how context and culture affect visual abstraction, how visual abstraction can be encoded, and the role that technology can play in facilitating visual abstraction, such as face recognition and search over a massive video database.
2 The ‘Hollow Effect’ Most of the information in our daily life is redundant. Studies [20] show that photos normally provide much more information than we need. This redundancy can be as high as ninety percent [15]. In facial communication, dramatic reductions in the spatial resolution of images can be tolerated by viewers [3]. From the point of view of psychology and economics, the level of detail in data communication can be greatly reduced. For example, photos in newspapers normally only have two values for each dot (pixel): with or without ink. With grid screen processing, the size of the smallest pixels is increased so that the number of dots per area can be reduced greatly. However, the picture is still recognizable. Increasing the level of details of the grid screen can make the image more attractive but not more recognizable nor comprehensible.
Fig. 1. We can often recognize everyday objects by their contours
The Russian psychologist Yarbus [28] used an eye tracking system to study the gazing path of the human visual search process. One of his significant discoveries was that human visual searching is guided by a context or purpose. Humans selectively look at things that interest them. Furthermore, humans anticipate things that are
Visual Abstraction with Culture
49
familiar. Fig.1 illustrates examples of daily objects that can be identified easily from their contours. The capacity to recognize these objects instantly depends on the level of experience the viewer has with particular objects. We call this information reduction phenomenon the ‘hollow effect.’ The more we know about the context of the object, the less information we need to recognize it. We conducted experiments to study the robustness of the hollow effect under varying conditions. We designed a lab experiment to understand: 1) the average minimal number of pixels of various images (face, indoors, outdoors, etc.) that subjects require for basic recognition; 2) the effect that guided questions can play in reducing the minimal number of pixels subjects require for face recognition; 3) the effect of age on recognition; and 4) the differences in recognition using black and white or color images. Ten unique images in both color and black and white formats were randomly chosen to cover four picture categories: (1) faces, (2) indoor scenes, (3) outdoors scenes, and (4) complex images, such as oil paintings. These images were also randomly ordered and presented individually at timed intervals by a simple computer program. We hypothesized that, given a set of randomly selected images, those containing human faces would be recognized at smaller resolutions, followed by simple, commonly known objects, and then by more complex indoor and outdoor scenes. Regarding facial recognition, we hypothesized that simple recognition of a generic face would require the least resolution, while gender identification and recognition of a well-known individual (i.e., former President Bill Clinton) would require more pixels. We further hypothesized that the subject’s age would have no effect on required image size, and that an image being in black and white or color would make a negligible difference — though with a slight advantage toward color images. Our initial prompt to subjects was simple. We simply asked subjects, “What is this?” after presenting each of the photos of face-only, indoors, outdoors, figures, and complex scenes (oil paintings). Subjects adjusted the size of the images until they could recognize the image. Facial recognition required significantly fewer pixels than human figures, indoor scenes, and outdoor scenes. As expected, complicated scenes required the largest number of pixels for identification. Finding that generic faces are recognized quickly, we next tested a set of photos of faces. We asked subjects three questions: “Who’s this?” “What is this?” and “Male or female?” respectively. The results show that answering the question “Who’s this?” required the fewest pixels (17 x 17 pixels). Subjects needed more resolution (32 x 32 pixels) to answer “What is this?” They needed even more pixels to identify gender (35 x 35 pixels). To some extent, the number of pixels associated with answers to each of these questions reflects the difficulty of the cognitive task. We found that face recognition needs far fewer pixels than we originally thought, especially if the subject knows what he/she is looking for. These results are consistent with previous findings that show human visual attention fluctuates according to the purposes of seeing, and that demand for visual information resources varies from task to task (Zabrodsky, 1997). When we think about our faces, we realize they are wellstructured compared to other objects. Humans are also well-experienced at identifying and discriminating between faces, which can make generic recognition of faces straightforward. Nonetheless, there are questions about face recognition that remain unsettled, such as whether humans have special “wired connections” to recognize a face.
50
Y. Cai et al.
Despite our experiments uncovering that image recognition can be task and context dependent, our experiments established a general relationship in the order of complexity in which objects are recognized. The order, ranging from fewer to more pixels, is “faces”, “outdoors”, “figure”, “indoors”, and “complex scenes.” Complex scenes, such as impressionist oil paintings, contain more vague objects that confuse viewers and can make recognition challenging even at high pixel resolutions. Our experiments also revealed that a “pixel” can serve as a legitimate numerical measurement of visual information processing. This is a controversial finding. For decades, cognition scientists have relied on other measures: reaction time, number of entities, error rate, and other quantities to measure visual information processing. However, we found that a pixel is a simple way to capture and compute visual processing within a normal human-computer interaction environment. It is simple but not without problems. Because of redundant pixels from extraneous sources, the pixels of an image may not always provide an accurate measure of visual information. To eliminate this problem, we found we had to preprocess the images to eliminate extraneous sources of pixel redundancy. For example, for face recognition tasks, we cut off the background that is outside the face outline. We also used a square image to simplify the measurement. Curiously, subjects needed slightly fewer pixels to recognize things with black and white images than color images. However, those differences were not statistically significant.
3 Image-Word Mapping Model Cognitive scientists have developed models to uncover the relationship between words and images. CaMeRa [21], for example, is a computational model of multiple representations, including imagery, numbers and words. However, the mapping between the words and images is linear and singular, which lacks flexibility. The Artificial Neural Network model, handling this greater flexibility, is proposed to understand images as complex as oil paintings [19], where Solso remarks that the hidden layers of the neural network enable us to map the words and visual features more effectively and efficiently. The claim is that, through hidden layers, we need fewer neurons to represent more images. However, what’s in the hidden layers of the neural network remains shrouded in mystery. Images consist of two- or three-dimensional structures in contrast to language’s one-dimensional construction. Thus, the mapping between words and images is a challenging task. In order for this one-dimensional construction to be able to work in tandem with the visual field, it must maintain the ability to go beyond one-dimensionality. Arnheim asserts that, through abstraction, language categorizes objects. Yet it is language that permits humans to see beyond mere shape [2]. This is not because language moves us closer to the physics of the world, but because language accommodates human interests and purposes by way of categorizing the physical world according to human affordances. Pinker [2] has explored these human affordance features of language in great depth. He notes that English conceives of water lines as a 2D boundary (over water, underwater) within a 3D space in part because the 2D difference matters to human survival. Surfaces can be presented as 1D, 2D, or 3D, depending on how we want to focus on motion relative to them. When we want to emphasize trajectory and direction without texture or resistance, we can
Visual Abstraction with Culture
51
describe the surface as one-dimensional, with a 1D referent, like path. We can say, “he walked along a path through the field” to indicate a linear trajectory. The preposition “along” cues a 1D trajectory. Should we want to emphasize the 2D texture of the field, we can reference the field directly with the preposition “across” – “she walked across the field.” If we want to emphasize resistance through a 3D space, we can say, “he walked through the field,” suggesting some stepping, some stomping, and some resistant brush. A man described as walking on or over the snow is understood to have an easier walk than one described as walking through the snow. As Pinker [16] characterizes it, the tense system is designed to help humans segment episodic time between that which is knowable and experienced (the past), that which is unchartered (the future), and that which is emergent and unclassified (the present). He shows that the language of causation is associated with what is foreseeable, over which one takes responsibility. Pinker holds that language provides affordable access to the needs of human categorization because it evolved that way. A language apparatus that could serve human classification needs and interests would, according to Pinker, have evolutionary advantages over language systems that could not. Pinker’s observations complement our own project, which is to study how far we can get by viewing language as a computational device for abstracting key features of non-linguistic types of information. By virtue of language, humans are inherently trained to go beyond object shape and explore further textures, dimensions, and sub-shapes; these further explorations seem to be the only method we have to satisfactorily describe a human subject. Along these lines, Roy developed a computerized system known as Describer that learns to generate contextualized spoken descriptions of objects in visual scenes [17]. Roy’s work illustrates how a description database can be useful when paired with images in constructing a composite image. Roy’s findings suggest that significant variation in the language used to describe an image results in significant variations in the images retrieved. Such variation can be reduced by organizing more constrained words and phrasal patterns into the computer as input. Roy’s work nicely illustrates how words are abstractions of images and images are extensions of words.
4 Descriptions for Humans In our project, we have focused on various aspects of the mapping between words and images for human features. For the rest of this paper, we describe these studies. We were intrigued by the rich mapping between words and faces both because of the cultural importance of faces and because facial descriptions are encoded in the literatures of the world. So comprehensively have faces been described in world literatures that language references have been compiled to record the various ways in which human faces can be rendered. For example, the Description Dictionary is a collection of descriptive words and phrases about human features from literatures from around the world. 4.1 Multiple Resolution Descriptions Human descriptions are classifiers for shape, color, texture, proportion, size and dynamics in multiple resolutions. For example, one may start to describe a person’s
52
Y. Cai et al.
figure, then hairstyle, face, eyes, nose, and mouth. Human feature descriptions have a common hierarchic structure. For example, figure, head, face, eye, et al. Like a painter, verbal descriptions can be built in multiple resolutions. The words may start with a coarse description and then ‘zoom’ into finer-grained sub-components. We have collected over 100 entries of multi-resolution descriptions from literature. Due to the limitation of space, we only enlist a few samples, where the underlined sections represent the global levels of description, and the bolded show the component-based descriptions. The italicized sections are the details: • •
•
“For a lean face, pitted and scarred, very thick black eyebrows and carbon-black eyes with deep grainy circles of black under them. A heavy five o’clock shadow. But the skin under all was pale and unhealthy-looking [6]”. “Otto has a face like very ripe peach. His hair is fair and thick, growing low on his forehead. He has small sparkling eyes, full of naughtiness, and a wide, disarming grin which is too innocent to be true. When he grins, two large dimples appear in his peach blossom cheeks [30]”. “Webb is the oldest man of their regular foursome, fifty and then some- a lean thoughtful gentleman in roofing and siding contracting and supply with a calming gravel voice, his long face broken into longitudinal strips by creases and his hazel eyes almost lost under an amber tangle of eyebrows [23]”.
4.2 Semantic Differential Representation The Semantic Differential method measures perceptual and cognitive states in numbers or words. For example, the feeling of pain can be expressed with adjectives, ranging from weakest to strongest. Figure 2 shows a chart of visual, numerical and verbal expressions of pain in hospitals: No Hurt (0), Hurts Little Bit (2), Hurts Little More (4), Hurts Even More (6), Hurts Whole Lot (8) and Hurts Worst (10).
Fig. 2. Expressions of pain in pictures, numbers and words
The physical feeling can be quantified with mathematical models. When the change of stimulus (I) is very small, we won’t detect the change. The minimal difference (ΔI) at which sensation is just noticeable is called perceptual threshold and it depends on the initial stimulus strength I. At a broad range, the normalized perceptual threshold is a constant, ΔI/I = K. This is Weber’s Law. Given the perceptual strength E, as the stimulus I changes by ΔI, the change of E is ΔE. We have the relationship ΔE = K*ΔI/I. Let ΔI be dI and ΔE be dE, thus we have Weber-Fechner’s Law:
Visual Abstraction with Culture
E = K * ln(I) + C
53
(1)
where C is constant. K is Weber Ratio. I is stimulus strength and E is perceptual strength. Weber-Fechner’s Law states that the relationship between our perceptual strength and stimulus strength is a logarithmic function. People remember better with cartoon-like figures because cartoons show exaggerated features. This result appears to be an application of Weber’s Law running instinctively in our visual memory. 4.3 Symbol-Number Descriptions In many cases, numbers can be added to give more granulites. For example, the FBI’s Facial Identification Handbook [9] comes with a class name, such as bulging eyes, and then a number to identify specific levels and types. The FBI has created a manual for witnesses, victims, or other suspect observers to use in identifying possible suspect features. The Catalog presents several images per page under a category such as “bulging eyes”; each image in such a category has bulging eyes as a feature, and the respondent is asked to identify which image has bulging eyes that most closely resemble the suspect’s. This book is an extremely efficient and effective tool for both forensic sketch artists and police detectives. It is most commonly used as to help a witness or victim convey the features of the suspect to the sketch artist in order to render an accurate composite sketch. 4.4 Analogical Descriptions Analogy is a coarse descriptor. Instead of describing features directly, people often refer to a feature through a stereotype, for example, a movie star’s face. The analogical mapping includes structural mapping (e.g. face to face), or component mapping (e.g. Lincoln’s ear and Washington’s nose). Children often use familiar things to describe a person, for example, using ‘cookie’ to describe a round face.
Fig. 3. Analogical description of noses and face type in Chinese
Analogies are culture-based. In the Western world, nose stereotypes are named according to historical figures. Many analogies are also from animal noses or plants. Fig. 3 illustrates examples of the nose profiles as described above. We use simple line drawings to render the visual presentation. Analogy is triggered by experience, which involves not only images, but also dynamics. The third sketch in Fig. 3 shows a ‘volcano nose’, which triggers readers’ physical experiences, such as pain, eruption, and
54
Y. Cai et al.
explosion. In this case, readers not only experience these senses, but also predict the consequences. Therefore, this is an analogy of a novel physical process that remains below the visible surface. Given a verbal description of the nose, how do we visually reconstruct that nose profile with minimal elements? In this study, we use a set of 5 to 9 ‘control points’ to draw a profile. By adjusting the relative positions of the control points, we can reconstruct many stereotypes of the nose profiles, and many other points in between. To smooth the profile contour, we apply the Spline curve-fitting model [26]. The cultural basis of analogy leads people to improvise analogies from daily objects. Similarly, different cultures build analogies from objects that are salient to them. For example, Chinese use an ‘onion’ shape to describe short snub noses and a ‘sunflower seed’ shape to describe slim faces. Chinese language has evolved from a pictorial language. Today, many Chinese characters are still used to describe people’s face shapes. Based on similarities between the face shapes and Chinese character shapes, face shapes are divided into eight types: tian (field), you (due), gui (country), yong (use), mu (eye), jia (jia), feng (wind), shen (apply). The last three sketches in Fig. 3 show the characters and correspondent shapes [14]. All languages are symbols. Understanding the cultural influences on analogical descriptions enables us to unleash the power of languages for visual abstraction. It took hundreds of years for the world to use Arabic numerical symbols. With crosscontinental telegraphy, it took days for the world to accept Winston Churchill’s V-sign. Networked computing is increasingly helping people share analogical descriptions across culture barriers. The unique descriptions from Chinese or Indian cultures can conceivably be used in other parts of the world, in unexpected applications, such as video search engines and in archeological reconstruction.
5 Decontextualization, Recontextualization and Emblemization Facial images can reveal much about psychological states but they do so imperfectly and with much error. In order to over-determine the accuracy of what facial images mean for the description of mood, the media often relies on a process called decontextualization. Decontextualization means to take an image out of context and use that image to reinforce verbal interpretations of a person’s internal state in contexts where there is less direct support. The coverage of Hillary Clinton provides an excellent example of this media phenomenon. From August 1998 through June 2000, NBC news ran ten broadcasts of facial images of Hillary Clinton from a memorial service honoring American victims of terrorist bombings where her face is shown in isolated profile with a shiny eye that looks tear-stained. For nine of these broadcasts, NBC never mentioned that the images came from a memorial service and instead “recontextualized” the image to offer support for their verbal commentary on two unrelated matters: the “strained” state of her marriage with Bill Clinton and the “pressure” she was under to decide whether or not to run for the Senate in 2000. During the release of the Starr Report hearings, NBC consistently used the image in profile to emblematize Hillary Clinton’s personal anguish as a betrayed spouse. During her Senate run, NBC hauled out the same image in profile to emblematize the pressure she was under to decide on a Senate run.
Visual Abstraction with Culture
55
Such uses of de-contextualization, re-contextualization and emblematization to create visual “mood bites” are ubiquitous in the media and add weight to undersubstantiated verbal descriptions of psychological states. Yet why “mood bites” work and the ethical boundaries between fair representation and distortion are still poorly understood. Based on the facial imaging environments we have created for other tasks, we will create a simulated user-controlled “news broadcast” that allows a user to manipulate the following two parameters along a continuous scale: the visibility of decontextualization (if de-contextualization becomes too visible, the image will seem “staged” and lose contextual credibility); and the exact placement in the verbal voiceover of where the image is shown. The idea is that, the longer the image is shown across multiple clauses of a verbal voiceover, the more the image is used to capture a general mood rather than a specific point. By way of contrast, when the image is displayed just over a specific clause and then removed, it functions more definitely and emblematically as evidence for a definite point expressed in the clause. By letting users manipulate these parameters on a stock news story, then fill out a survey covering the “information value” and “fairness” of the visual display, we believe we can make significant progress in understanding the role and excesses of the de-contextualized visual in media.
6 Interactive Facial Reconstruction We developed a prototype of the interactive system for facial reconstruction on a computer. In the system, a user selects the feature keywords in a hierarchical structure. The computer responds to the selected keyword with a pool of candidates that are coded with labels and numbers. Once a candidate is selected, the computer will superimpose the components together and reconstruct the face. As we know, a composite sketch of a suspect is often done by professionals. Our system enables inexperienced users to reconstruct a face with a menu-driven interaction. In contrast, the reconstruction process is reversible, so it can be used for facial description studies, robotic vision and professional training.
Fig. 4. Interactive front facial reconstruction based on image components
56
Y. Cai et al.
7 Conclusions In this chapter, we have explored the impacts of culture and aesthetics on visual abstraction. Based on everyday life experiences and lab experiments, we found that the factors of culture, attention, purpose, and aesthetics can help reduce visual communication to a minimal footprint. As we saw with the hollow effect, the more we are familiar with an object, the less information we need to describe it. The Image-Word Mapping Model we have discussed allows us to work toward a general framework of visual abstraction in two directions, images to words and words to images. In this chapter, we have simply overviewed that general framework and presented some of the small studies we have undertaken within it. These studies involve exploration into multi-resolution, symbol-number, semantic differentiation, analogical and cultural emblematization aspects of facial features. None of the studies we conducted could have been started and completed solely through algorithms. Matters of context and culture reared their head at every turn. Despite some substantial progress, this research remains at an early phase. We have successfully identified many puzzle pieces teaching us how visual abstraction works through language and other mediators. But we have yet to find all the pieces we need even to understand the larger puzzle frame. We need the puzzle frame to understand how best to fit together the pieces we have identified and to learn about the pieces still hidden from view.
Acknowledgement We are also in debt to Emily Durbin and Brian Zeleznik for their assistance.
References 1. Allport, A.: Visual Attention. MIT Press, Cambridge (1993) 2. Arnham, R.: Visual Thinking. University of California Press, Berkeley (1969) 3. Bruce, V.: The Role of the Face in Communication: Implications for video phone design. Interaction with Computers 8(2), 166–176 (1996) 4. Cai, Y.: How Many Pixels Do We Need to See Things? In: Sloot, P.M.A., et al. (eds.) ICCS 2003. LNCS, vol. 2657. Springer, Heidelberg (2003) 5. Chen, J.L., Stockman, G.C., Rao, K.: Recovering and Tracking Pose of Curved 3D Objects from 2D Images. In: Proceedings of IEEE Comput. Vis. Patt. Rec. Publisher, New York (1993) 6. Doctorow, E.L.: Loon Lake. Random House, New York (1980) 7. Duchowski, A.T., et al.: Gaze-Contingent Displays: A Review, Cyber-Psychology and Behavior (2004) 8. Fan, T.J., Medioni, G., Nevatia, R.: Recognizing 3-D Objects Using Surface Descriptions. IEEE Trans. Patt. Anal. Mach. Intell. 11, 1140–1157 (1989) 9. FBI Facial Identification Catalog (November 1988) 10. Larkin, J.H., Simon, H.A.: Why a Diagram Is (Sometimes) Worth 10,000 Words. Cognitive Science 11, 65–100 (1987) 11. Lowe, D.G.: The Viewpoint Consistency Constraint. Int. J. Comput. Vision 1(1), 57–72 (1987)
Visual Abstraction with Culture
57
12. Luo, J., et al.: Pictures Are not Taken in a Vacuum. IEEE Signal Processing Magazine (March 2006) 13. Majaranta, P., Raiha, K.J.: Twenty Years of Eye Typing: Systems and Design Issues. In: Eye Tracking Research and Applications (ETRA) Symposium. ACM, New Orleans (2002) 14. Mou, F., Li, Z.: Modern Surgery and Techniques. New Time Press, Beijing (2003) ISBN 7-5042-0851-5 15. Petersik, J.T.: The Detection of Stimuli Rotating in Depth Amid Linear Motion and Rotation Distractors. Vision Research 36(15), 2271–2281 (1996) 16. Pinker, S.: The Stuff of Thought: Language as a Window into Human Nature. Viking Press, New York (2007) 17. Roy, D.: Learning from Sights and Sounds: A Computational Model. Dissertation. Media Arts and Sciences. MIT, Cambridge (1999) 18. Silberberg, T.M., Davis, L.S., Harwood, D.A.: An Iterative Hough Procedure for ThreeDimensional Object Recognition. Pattern Recognition 17(6), 621–629 (1984) 19. Solso, R.L.: Cognition and the Visual Arts. MIT Press, Cambridge (1993) 20. Stroebel, L.D., et al.: Visual Concepts for Photographers. Focal Press, New York (1980) 21. Tabachneck-Schijf, H.J.M., Leonardo, A.M., Simon, H.A.: CaMeRa: A Computa-tional Model of Multiple Representations. Cognitive Science 21, 305–350 (1997) 22. Ullman, S., Basri, R.: Recognition by Linear Combinations of Models. IEEE Trans. Patt. Anal. Mach. Intell. 13, 992–1006 (1991) 23. Updike, J.: Rabbit is Rich. Ballantine Books, New York (1996) 24. Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, December 8-14, vol. 1, pp. 511–518. IEEE Computer Society Press, New York (2001) 25. Chiariglione, http://www.chiariglione.org/MPEG/standards/mpeg-7/ mpeg-7.htm 26. Wikipedia (2007), http://en.wikipedia.org/wiki/Spline_mathematics 27. Web Exhibits, http://www.webexhibits.org/colorart/ag.html 28. Yarbus, A.L.: Eye Movements during Perception of Complex Objects. Plenum, New York (1967) 29. Zabrodsky, R., Peleg, S., Avnir, D.: Symmetry as a Continuous Feature. IEEE Trans. Pattern Analysis and Machine Intelligence 17(12), 1154–1166 (1995) 30. Isherwood, C.: Berlin Stories. Random House (1952)
Genre and Instinct Yongmei Hu1, David Kaufer2, and Suguru Ishizaki2 1 2
College of Foreign Languages, Guizhou University Department of English, Carnegie Mellon University
Abstract. A dominant trend relates written genres (e.g., narrative, information, description, argument) to cultural situations. To learn a genre is to learn the cultural situations that support it. This dominant thinking overlooks aspects of genre based in lexical clusters that appear instinctual and cross-cultural. In this chapter, we present a theory of lexical clusters associated with critical communication instincts. We show how these instincts aggregate to support a substrate of English genres. To test the cross-cultural validity of these clusters, we gave three English-language genre assignments to Chinese students in rural China, with limited exposure to native English and native English cultural situations. Despite their limited exposure to English genres, students were able to write English papers that exploited the different clusters in ways consistent with native writers of English. We conclude that lexical knowledge supporting communication instincts plays a vital role in genre development.
1 Introduction This chapter examines the learning of written genres (e.g., narrative, information, description, argument) as partly an instinctual process, in the manner of Chomsky’s innateness hypothesis for language learning. Pinker [1], defending that hypothesis, challenged various assumptions about language education and the learning of writing in school, particularly at the sentence level. Pinker argued that descriptive grammar was innate and “instinctual” and developed independently of experience, schooling, or even learning. Descriptive grammar, according to Pinker, provides a native substrate for the more superficial and culturally dependent prescriptive grammar taught and learned in school. Much prescriptive grammar, Pinker argued, is empirically suspect, and, even when not, at best marginally improves on a writer’s innate skill rather than embodies or essentializes that skill. This chapter takes up the question of whether the notion of written genre might itself be based on universal “instincts” behind the formulation of “lexical clusters” and aggregations of these clusters to construct “reader experience,” or whether genre depends on rule-based learning subject to cultural variation. Stating the matter as an all or nothing “either-or” between instinct and rule is almost certainly, in our view, too strong. Pinker’s (and Chomsky’s) innateness hypothesis remains controversial and the subject of much discussion [2]. In addition, the notion of “instinct” is itself ambiguous, ranging between a strict notion of “closed instinct,” which polarizes instinct and experience, and a notion of “open instinct,” which fashions instinct and learned experience into permeable concepts, allowing that Y. Cai (Ed.): Computing with Instinct, LNAI 5897, pp. 58–81, 2011. © Springer-Verlag Berlin Heidelberg 2011
Genre and Instinct
59
instincts are not learned but better honed through learning and experience [3]. We adopt the latter position that written genres are largely based on open instincts about lexical clusters and the way they co-occur to produce reader experience. This means that we acknowledge that knowledge of written genres develops in part along with writing experience and an increasingly refined knowledge of the cultural situations that genres support. But it also means that the teaching and learning of genres from a cultural perspective can improve the writer’s performance, at best, only at the margins. The dominant North American pedagogies that cover genre emphasize the cultural determinants of genre, with genre variation largely explained as a manifestation of variation in cultural situations [4-7]. This cultural emphasis toward genre is well-motivated and, for the most part, well-executed. However, it has overlooked its own limitations by ignoring or leaving understudied aspects of genre knowledge that work instinctually and remain independent of one’s cultural positioning or experience. In this chapter, we review what the cultural-based pedagogies ignore. We also show what the cultural-based pedagogies would not predict, namely that second language writers with limited exposure to cultural situations in English show key rudiments of genre knowledge by virtue of the basic English fluency they claim. The driving engine of genre performance, in our view, is the deployment of open instincts about lexical clusters in language and the way these clusters aggregate to produce reader experience. These lexical clusters and their aggregation potential are implicit and never formally learned nor taught either to first language or secondlanguage writers. Yet, these clusters and aggregation mechanisms seem available to first or second language writers as they acquire basic fluency in a language. In the next section, we describe the work we have done to identify these instinctual clusters. In the section following that, we describe some basic aggregation rules that help define reader experience across various genres. In the section after that, we show how Chinese students are able to reproduce some major genre differentiations, despite being asked to do so in an impromptu setting, with no prior notification or instruction, with no formal schooling in English language genres and undergoing limited exposure to the cultural situations of the native English speaker.
2 Clusters of Lexical Experience Constituting Genre In this section, we briefly run through seventeen major clusters of lexical experience constituting genre. A monograph length treatment of these clusters in slightly different form is available [8]. 2.1 First Person Reference/Attribution This cluster allows for self-reference and self-attribution through the use of first person singular pronouns (e.g., I, me, my, myself). This cluster permits authors or characters to express their own involvement in texts.
60
Y. Hu, D. Kaufer, and S. Ishizaki
2.2 Third Person Reference/Attribution This cluster shifts the perspective to the third person through pronouns like “he” and “she” with various predicates indicating the involvement of third parties. This cluster stresses pronouns more than adjective + noun combinations (e.g., the old man), because the latter may be a one-shot reference; whereas the interest of third-person attribution to whole text experience is when the third person reference persists [9] to create a sense of narrative experience about a character. A succession of co-referential third person pronouns is correlated with narrative experience. Moreover, narrative experience is much more a trademark of fiction than non-fiction texts. If one does a statistical tabulation of third person pronouns (e.g. he, she) across the Brown Corpus, as we did, one will find a statistically significant higher proportion in fiction over non-fiction entries. 2.3 First Person Private Register This cluster of words creates the experience of a “private” register, a first person narrator giving voice to thoughts that are not normally publicly known or circulated. This cluster is created by aligning words of self-reference to private cognition verbs indicating personal thought or feeling (e.g., I think, I feel, I believe); or reluctance (e.g., I regret that, I am sorry that, I'm afraid that). An autobiographical effect is created with the use of the aspectual “have” or “used to” (e.g., I have, I used to). 2.4 Generalized Private Register This cluster deepens private experience from the first or third person perspective and can be conveyed through free floating cognitive verbs (e.g., believe, feel, conjecture, speculate, pray for, hallucinate); verbs and adverbs expressing subjectivity (e.g., seems, seemingly, tentative, one way to think about), subjective time (e.g., seems like only yesterday, in a flash), markers of epistemic stance such as confidence (e.g., completely true, assuredly so) and uncertainty (e.g., maybe, perhaps). This cluster is also activated through the use of confessional verbs (e.g., confess, acknowledge that, admit, let on that, let it slip that), adjectives and adverbs expressing intensity, (e.g., very, fabulously, really, torrid, intensely, amazingly), temporal adverbs expressing immediacy (e.g., right now, now, just then). 2.5 Narrative This cluster marks the development of a story unfolding one event after another in time. It becomes active through the use of verbs that convey witnessed action (e.g., came, saw, conquered), temporal clauses (when she was young), temporal verb phrases (would never again) temporal prepositional phrases (in her youth) denoting biographical time and time intervals (for two years, over the last month), temporal adverbs indicating time shifts (e.g., next week, next month) and time/date information (e.g., on June 5, 2000).
Genre and Instinct
61
2.6 Interactivity This cluster indicates interactivity between the writer and reader or between characters in a text. It is activated by a multitude of English subclusters, such as curiosity raising, stimulating another mind to formulate questions arising along a common path of inquiry (e.g., what then shall we make of?); requests (e.g., request, ask of); direct address, the mental equivalent of hailing or summoning another’s attention (e.g., let us, I advise you, I urge, I urge you to, I recommend, I favor) or making implicit acknowledgement of an interlocutor (e.g., you, trust me, okay now); questions (Wh-questions – what do you think? how may I help?); quantitatively oriented survey questions (e.g., how often do you; with what frequency do you?)and conventional follow-ups (e.g., in response to your latest memo, per your last message). This cluster also involves grounded feedback of various sorts (e.g., okay, good, yes, go on, alright). 2.7 Reasoning This cluster denotes paths of inquiry that the writer and reader share and traverse together under the writer’s direction. Subclusters consist of constructive reasoning that builds paths either in a forward direction (premise-conclusion) (e.g., thus, therefore) or a backward direction (conclusion-premise) (e.g. because, owing to the fact, on the grounds that, for the reason that). Constructive reasoning also includes initiating reasoning to launch an inquiry (e.g., suppose that, imagine that) and legitimating the link between premise and conclusion (e.g., as evidence for, in support of). A second subcluster is reasoning under contingency (if…then, it depends). A third is oppositional reasoning, which is used not to construct the reasoning of oneself but to acknowledge or block the reasoning of another. Oppositional reasoning itself consists of subclusters, such as denials/disclaimers (e.g., not the case, not do it, am not a crook), concessives (e.g., although, even if, it must be acknowledged), and resistance, which evokes the tension between competing ideas, events, forces, or groups (e.g., resistant, veto, counterargument, filibuster, went into battle against). 2.8 Emotional Valence This cluster indicates emotionally tinged language, expressing positive (e.g., wonderful, marvelous, great, fantastic) and negative emotion (e.g., mistreatment, criminal, despised). Our investigations suggest that negative emotion is more elaborated in English than positive emotions and that negative emotions can be subcategorized into large subclusters of anger (e.g., irritable, rancorous, indignant), fear (afraid, scared, worried), and sadness (inconsolable, distraught, sorrow, depressed). 2.9 Interpersonal Relations This cluster structures interpersonal relationships between texts, again between author and reader or between characters in a text. The relationships can be primarily positive, such as promising (e.g., I promise), reassuring (e.g., don't worry about), reinforcing (e.g., good job), acknowledging (e.g., thank you), and inclusive (e.g. let us all work as a team). There are further negative relationships, primarily threats and confrontations (e.g., What were you thinking of? I'll kick your butt).
62
Y. Hu, D. Kaufer, and S. Ishizaki
2.10 Past Orientation This cluster is active when the writing references a time prior to the time the writing was created. The narrative cluster (1.5 above) also references time past. However, this cluster contains structures of English that convey past orientation that are not on the main event path of a narrative flow of events, but that can elaborate that flow or that can appear in a text as a stand-alone past reference. These include non-narrative expressions that signal a mental leap to the past (e.g., used to, have been, had always wanted, would have liked to), or a future-in-the-past that adds critical perspective to a narrative (e.g., Lincoln was to look for the general who could win the war for him; Mom was to be home by now). 2.11 Future Orientation This cluster is active when the text orientation makes a casual reference to the future (e.g., in order to, look forward to, will be in New York) or actively predicts the future (e.g., By 2020, the social security fund will be bankrupt). As we will see below, one of the genres that makes most active use of this cluster are instructions. In this genre, the future is regularly used to guide audiences by telling them what they will see if they take correct or incorrect actions (e.g., turn right at the corner and you will see a gas station). 2.12 Description This cluster is active when the text refers to input from the five senses: the sights, sounds, aromas, touches ("the warm embrace") and tastes ("the salty crunch of bacon") of experience. Description conjures concrete spaces, scenes, and objects with lively, colorful properties in the reader’s mind. It also includes words that reference visible objects (e.g., table, chair), spatial relations (e.g., near to, seated with), motions (e.g., run, jump), dialogue cues (e.g., “….,” he barked.) and oral speech features (e.g., uh huh). 2.13 Public Register This cluster helps constitute important aspects of public or institutional genres, namely words and phrases that reference institutions (e.g., judicial, electoral, senatorial) and groups and roles authorized by them (republicans, democrats, chairmen, official). There are many subclusters, including words invoking institutions, groups, and roles functioning as commonplace authorities (e.g., founding fathers, Emirs, parliamentary course of action, duly authorized), precedents (e.g. widely accepted), received points of view (e.g., some hold that; others maintain that), or with established thought that is now being rehearsed or confirmed (e.g., I recognize that, I agree with). 2.14 Values This cluster, often aggregating with public registers, is active when public values are invoked. The values may be positive (e.g., justice, happiness, fairness, the good) with one strand of positive values referencing innovation (e.g., breakthrough, cutting-edge,
Genre and Instinct
63
state-of-the-art). They may also reference values that are negative (e.g., injustice, unhappiness, unfairness, bad). 2.15 Information This cluster, also often aggregating with public registers, includes methods of exposition that indicate traversal of an information space. Unlike narrative structures, which move temporally from event to event, information is structured around hierarchically organized “points” and “sub-points.” Subclusters include generalization (e.g., all, every), examples and illustrations (e.g., for example, to illustrate), comparison (e.g., more than, fewer than), resemblance (e.g., resembles, looks like), specification (e.g., in particular, more specifically), exceptions (e.g., an exception, the only one to, sole), and definition (e.g., is defined as, the meaning of the term). 2.16 Reportage This cluster, often co-occurring with information structures, is active when verbs are used to report states of affairs (e.g., is carried by, is housed in), events (e.g., established, instituted, influenced), processes (e.g., make, change, transform), updates (e.g., the latest, late-breaking, announcing a new) and sequences (e.g., first, second, third). Reportage verbs differ from narrative verbs mainly by the register with which they are associated. Narrative verbs most strongly correlate with private registers, reporting verbs with public registers. 2.17 Directives Directive clusters are active when the reader of a text or a character within the text is directed to believe or do something. Subclusters include imperatives (e.g., come to; stop what you are doing); procedures, (e.g., fold on the dotted line, do not bend, fold, or mutilate), moving the body (e.g., clasp, grab, twist, lift up); error-recovery (e.g., should you get lost...), and insistence (e.g., you need to come; you need to consider). Hallmarks of insistence are also conveyed in the modals "must," "should," "need," and "ought.”
3 Clusters Composed into Prototype Genres We were able to treat our 17 clusters as multivariate objects that aggregate and divide to create seven prototype genres. [10] By prototype genres, we mean genres that form a deep lexicon providing building blocks of many historically specific genres. 3.1 Self-portraits Self-portraits underlie English genres of self-expression, ranging from diaries and personal journals to autobiographies, first person blogs, and cover letters. Selfportraits tend to aggregate First Person Reference/Attribution [cluster 1.1], First Person Private Register [cluster 1.3], Generalized Private Registers [cluster 1.4] and Narrative [cluster 1.5]. When used for business cover letters, self-portraits tend to
64
Y. Hu, D. Kaufer, and S. Ishizaki
aggregate with Public Registers [cluster 1.13] and clusters variously associated with public registers [clusters 1.15 -1.16, namely information and reportage]. The writer of the business cover letter is constructing an image of him or herself, but that image is narrowly constrained to employment interests and trajectories. 3.2 Observer Portraits Observer portraits, such as bibliography, third person memoir, and personal and professional profiles, underlie English genres. These portraits range from the lighthearted and subjective profile of a teen to the staid and consequential profile of a CEO for an annual report or a candidate for elective office. Observer portraits take as their foundation third person reference/attribution [cluster 1.2], description [1.12] and values [cluster 1.14] as their foundation. The genre, at base, describes another person and his or her values from the third person perspective. Observer portraits also tend to employ a generalized personal register [cluster 1.4] and narrative [cluster 1.5] if the portrait is personal, and a public register and its associate clusters [information, and reportage] clusters 1.15-1.16] if the profile is public. Like self-portraits, observer portraits can be literary and range across all aspects of life in and outside of a professional resume. However, observer portraits can also be professionally focused (a CEO profile) and limited in the observations they make about the subject of the portrait. 3.3 Scenic Writing Scenic writing underlies sensory writing found in poetry, literature, and fiction writing for the leisure market. It is also a component of genres where visual information is necessary and not just nice (e.g. observer portraits, field writing among geographers and anthropologists, science writing for lay audiences, and instructions). Further, scenic writing enhances the exposition in any form of writing where the visual is an optional dimension. Scenic writing relies on description [cluster 1.2] and narrative [cluster 1.5], where the challenge is to tell stories that naturally emerge from the close observation of spaces. Scenic writing often requires the writer’s discipline to remain within description, so as to keep the reader as an eyewitness to what the writer sees and hears, rather than what the writer thinks. 3.4 Narrative History Narrative history underlies the genres of history, biography, autobiography, and literature told from a narrative perspective. The key clusters are narrative [cluster 1.5], a past orientation [cluster 1.10] and interpersonal relations [cluster 1.9], as most writing with a narrative basis deploys shifting input about interpersonal relationships as a way of keeping the reader involved. The narrative may remain within the private register [1.4] as a piece of personal history, or within the public register [cluster 1.13], as a piece of institutional history. It may also be launched from the first [cluster 1.1] or third person [cluster 1.2]. Narrative histories seek to help a reader recover a world that is no more, with the vividness of yesterday.
Genre and Instinct
65
3.5 Information Information writing underlies all genres that seek to supervise learning for a reader. Unlike experiential writing in the leisure market, where the reader’s learning is unsupervised, information writing tends to contractually lay out for the reader (through purpose statements and other devices) what to learn. The reader is a client of the writing and not simply a patron. Unsurprisingly, the information cluster [cluster 1.15] is the fundamental cluster driving information genres; but information genres also tend to rely a great deal on interactivity [cluster 1.6] in order to establish a relationship of trust and personal connection with the reader as learner. Information genres tend to rely on public rather than private registers and so, when the information is event driven, it will rely on reportage [cluster 1.16] – and sequence if the events reported occur in sequence; and it will turn to narrative [cluster 1.5] only if personal stories are seen as a good way to illustrate the main points being conveyed. What distinguishes information writing is that the reader is promised some durable learning (e.g., “points to retain”) meant to survive the reading experience itself. 3.6 Instruction Instructional genres are evident in product instructions, procedures, regulatory forms, and any other document seeking to build manual skill or to achieve compliance. The writing underlying these genres depends upon directives [cluster 1.17] as a way to guide the reader through a manual task or a task space of requirements and constraints. Instructional writing also tends to be dependent upon spatial description [cluster 1.12] and even a future orientation [cluster 1.11] insofar as readers are often instructed how the future will appear (e.g., you will see) as a way of offering visual confirmation of correct action to the instruction taker. Instructions must accommodate the reader’s interest in solving spatial problems or achieving compliance with regulations. 3.7 Argument Argument writing underlies persuasive writing across the disciplines, the professions, and civic life, from legal briefs and appellate decisions to academic journal articles, petitions, memoranda of professional lobbyists, candidate press releases and letters to the editor. A hallmark of such writing is a focus on reasoning [cluster 1.7] along with a focus on a public register [cluster 1.13] and values [cluster 1.14]. In truth, argument is a situational and dynamic genre, meaning that the writer anticipates what clusters are most needed to break down the resistance to the message being offered and then deploys those clusters. In this sense, no cluster is out of bounds when it comes to argument.
4 Genre and Instinct The lexical clusters elaborated above are never explicitly taught in school, neither to native nor non-native speakers. They seem to be part of the natural equipment of using language in the oral register prior to schooling. They seem, as we have asserted,
66
Y. Hu, D. Kaufer, and S. Ishizaki
more the product of instinct than rule. Moreover, because these clusters combine in various combinations to support genre, genre itself seems to have an instinctual inheritance in language apart from convention or culture. Genre [5] is a recognized category of writing that shares a common form, purpose or content. Student writing classes and assessment protocols are often organized into genre tasks. A common assessment genre in the United States is the five paragraph essay [11] and in China the two or three paragraph essay [12]. Yet, however formularized a genre may become for evaluation purposes, it defies exhaustive specification through rules, and students who typically construe the task as nothing more than applying and memorizing rules do poorly. Formulas can help writers adjust to the overwhelming demands of an assignment. They can be a good place for writers to start, but writers would not be writers if formulas were a good place to end. Knowing we are involved in one genre or another as a reader or writer is an instinctual judgment that eludes formulas. We have no trouble recognizing the discrete sentences of a detective novel or cookbook, but we have a much harder time describing the different experiences in which these sentences are embedded. This instinctual aspect of genre makes it especially challenging for evaluators of student writing to make assessments like “the student’s writing is appropriate to the writing task.” Insofar as the writing task is associated with an evaluation of genre appropriateness, this criteria calls upon two instinctual judgments – the genre under evaluation and the achievement of appropriateness within this genre through the discrete actions of the student writer. The writing assessor may think the student has hit or missed the bulls-eye by either a little or a lot; but, in either case, both the target to be hit (the genre) and the extent to which it is hit (appropriateness) at least begin in gut-level judgments that may never move beyond that. Our instincts about a text’s appropriateness or genre (or both) are difficult for serial readers to elaborate on because serial readers pay inordinate attention to the text segments they are currently processing [10], whereas analyses of genre must take into account the collective weight of the whole text at once. Instinctual judgment for the serial reader of text is not necessarily immediate judgment. Often, a reader must sample widely across a text to determine if his or her first instincts survive or are overthrown by secondary and later instincts. This fact significantly increases the labor of assessing writing, particularly judgments of appropriateness (of task or genre). If a paper misses the ballpark of genre appropriateness, line by line commentary can be fruitless and futile. The student will need to rethink the task representation rather than fix broken patterns. Yet teachers can waste vast amounts of time reading and scanning to form a conclusive instinctual judgment about appropriateness.
5 Computational Recognition of Lexical Clusters Fitting Genre Computers are able to weigh multiple passages at once and so can, in principle, settle on instinctual judgment more cost-effectively than humans. To automate instinctual judgments of genre identification for various purposes, including writing assessment, Kaufer and Ishizaki created a computational environment, DocuScope [8, 10, 13], that makes judgments about a text’s genre characteristics from student writing. The program provides an environment for tagging student writing samples through pre-defined dictionaries, including 200 million patterns comprised of individual
Genre and Instinct
67
words and word sequences created through incremental coding and testing of hundreds of texts across dozens of genres. By handing coding words and short sequences, typically between 2 and 6 words in length, into functional categories, DocuScope classifies a text through human-informed local judgments, and these judgments in turn can be aggregated to generate an impression of the whole. As a whole, the patterns are designed to span the variation in English prose that one can find by reading across a spectrum of American English prose, from fiction to non-fiction, from everyday informal texts to erudite academic treatises. The patterns found are then statistically analyzed by multivariate methods in order to determine which family of patterns are used centrally (prototypically) in a text or corpus of texts and which are used only at the margins. DocuScope has been used in writing education in the states, but mainly as a tool for textual researchers. At the heart of the DocuScope program is a text visualization and analysis environment specifically designed to carry out rhetorical research with language and text. The program permits human knowledge workers, through computer-aided visual inspection and coding, to harvest and classify strings of English, primarily 1-5 contiguous word sequences. These are strings that, without conscious effort, speakers and writers use and reuse as part of their vast repertoire of implicit knowledge relating to language and the audience experience of the clusters constituting genre. We have chosen a knowledge-based, expert-system-like approach for our language measures because we were especially interested in the analysis and discovery of textual genres. Genres lie at the interaction of language and culture as tools to perform situated work [3]. To capture them requires a breakdown of texts that considers how texts structure experience for human readers. Employing an implementation of the Knuth-Morris-Pratt string matching algorithm [14], the DocuScope environment has the ability to uniquely discriminate patterns in a textual stream of any arbitrary length. For example, the string matcher knows that the 9 word string “the cat jumped over the table to get food” is non-identical to, and can be classified independently of, the 10 word string “the cat jumped over the table to get food yesterday.” This flexibility provides the capacity to separate small but important functional differences in the textual stream with contextual nuance. For example, we could discriminate the shorter string “smeared her,” which could be a negative category, from the longer string “smeared her with soap in the tub,” which indicates physical motion more than negativity. Aggregating functional categories like negative affect, physical motion and dozens of others functional categories into frequencies and comparing the frequencies against different text types has shown DocuScope to be a useful tool for distinguishing major genres of English. To find and classify strings of reader experience in English texts, we employed iterative hand coding methods. We first coded a list of categorized strings and then observed how they matched on a set of sample texts. We used our string-matcher on new texts to test our prior codings for accuracy and completeness. When we discovered our string-matcher making incorrect (i.e., that is, ambiguous, misclassified, or incomplete) matches on the new texts, we would use this information to elaborate the strings our string matcher could recognize. By repeating a cycle of coding strings on training texts and then testing and elaborating strings based on how well they explained priming actions on new texts, we were able to grow our catalog of strings systematically and consistently. Let’s now watch this process in action. Imagine
68
Y. Hu, D. Kaufer, and S. Ishizaki
reading a newspaper containing “smeared the politician” as a verb phrase. This first inspection would prompt the generalization that the string “smeared the politician” conveys the idea of negative affect. We would code it thus in our dictionaries. We would input into our dictionary many syntactic variations that also conveyed this negative idea (e.g., smeared him, smeared them, smeared them). Now these dictionaries would be applied to new collections of uncoded texts, allowing us to find the limits of our own initial coding assumptions. For example, the coding of “smeared him” with a negative effect would seem incorrect in the longer string environment of “smeared him with soap.” Although errors of this type were ubiquitous, particularly in our early generations of coding, the software made it relatively easy for us to spot the mistakes and revise the underlying dictionaries accordingly. We thought of this rapid revision process as “improving the eyesight” of the dictionaries by putting human readers in to assist it. Over three years, we repeated this process of adding, testing, and differentiating strings of English over thousands of texts. We stayed with this process until we had nearly 150 categories that seemed robust and stable and that could differentiate, in principle, millions of strings. As our coding of strings evolved, we were able to derive formal decision criteria for classifying strings into one of 17 overall clusters. The string matcher can match any literal string of English of any length. For efficiency of coding, the software allowed us to run the string matcher on up to 500 texts at a time and over any number of user-defined categories of different strings. When the string matcher found a matching string in any of the target texts, it tagged it by name and color. The visualizer made it relatively easy for the research team to study the performance of the string-matcher and to improve it rapidly based on errors in its performance. The visualizer made it possible to build a very large and consistently classified inventory of priming strings in a relatively short amount of time. Where did we find the speech and texts to look for priming strings? We sampled the Lincoln/Douglas debates [15], texts associated with description, narrative, exposition, reporting, quotation, dialogue and conversational interaction. We also relied on three “seed” text collections. The first was a 120 text digital archive of short stories and fictional work. The second was a database of 45 electronic documents associated with a software engineering project, including proposals to the client, software design specifications, meeting minutes within the design team, meeting minutes between the design team and the client team, software documentation, focus group reports, public relation announcements, and feature interviews. We constructed a third archive from the Internet: the Federalist papers, the Presidential Inaugurals, the journals of Lewis and Clark, song lyrics from rappers and rockers, the clips of various syndicated newspaper columnists, the Web page welcomes of 30 university presidents, Aesop fables and the Brother’s Grimm, the writings of Malcolm X, the 100 great speeches of the 20th century, 10 years of newspaper reporting on the Exxon Valdez disaster, and movie reviews. We sampled 200 texts from this collection and made sure that we had multiple instances of each type of writing so that each type could be divided into training and test runs as we cycled through test and improvement cycles. On a weekly basis over a three-year period, we also coded strings from New Yorker magazine and from the editorials, features, and news pages of The New York Times. To capture data from speech, we coded for 2 to 4 hours every week the
Genre and Instinct
69
strings we identified heard over radio stations that focused on news, talk, or sports. The visualization environment allowed us to visually inspect and test new samples in our archive. To further control quality, we built a collision detector that would warn us if we assigned the same string to multiple categories. This helped us locate debugging inconsistencies and ambiguities in the string data. The visualization environment we taught with [10] also became a centerpiece in one of our graduate writing courses. As part of their continuing training in close reading, students were asked to keep semester logs of the matched strings they found in their own writing and in the writing of their peers. They were asked to keep systematic notes about whether the strings matched by the software were ambiguous or incorrect. In cases where they found errors, students proposed changes to the software’s internal dictionaries as part of their log-work. If their proposals were verified by the course instructors, the internal dictionaries were changed to reflect them. It is beyond the scope of this paper to say more about these methods, but further discussion about these techniques is available elsewhere [8, 13].
Fig. 1. Separating genres in a 2D projection of N dimensions by selecting and aggregating lexical clusters on an X and Y axis. This is a snapshot of one of the DocuScope interfaces used to separate genre through the selection and aggregation of specific lexical clusters. In this figure, the user has selected clusters (past, description, narrative) on the Y axis associated with reminiscences. The user has selected clusters (first person, first person personal register, personal register, and interactivity) on the X axis associated with letters. The interface confirms that these features are relevant to defining similarities and differences between these genres by separating reminiscences high on the Y axis and letters to the far right on the X axis.
70
Y. Hu, D. Kaufer, and S. Ishizaki
The relevant point for now is why we needed software in the first place to capture the communication instincts evident in a text. The web of consideration behind the “instinct,” that a paper is appropriate or not to genre, is vast, so much so that it is easier for a computer, monitoring hundreds of lexical clustering patterns at once, to keep track of. Figure 1 displays a visual snapshot of one of DocuScope’s interfaces, which is used to separate genres visually according to the lexical clusters they make active. In Figure 1, 30 civil war letters and 30 reminiscences were drawn from the Internet. Letters are expected to be based more in first person, personal, and interactive lexical clusters than reminiscences. Reminiscences, by contrast, are expected to alert the reader to recalled situations from the past that are more deeply spatially and temporally elaborated than the situations of letters. Figure 1 shows how the DocuScope interface can confirm these hypotheses. The user clicks on different lexical clusters on a 2-D (X, Y) projection of N dimensional space to isolate and combine lexical clusters into genre groupings. In Figure 1, the user has selected clusters on the Y axis (viz., past, descriptive, narrative) favoring reminiscences and clusters on the X axis (viz., first person, first person personal register, personal register, and interactivity) favoring letters. As one can see, the selection of these clusters causes a visual separation of the actual letters and reminiscences. The reminiscences, coded in orange, dominate the top (Y) axis, while the letters, coded in red, dominate the (X) axis to the right.
6 Instinct and Rule in the Assessment of Writing Instincts are an important criterion for assessing student writing, but not the only criterion. Assessment criteria involve a mixed assortment of judgments across various grain sizes of text. The verbal criteria typically begin, we have argued, with coarsegrain judgments that one might call “instinctual.” These judgments depend upon making a gut reaction about the document’s overall “appropriateness” to the writing task or genre. Instinctual judgments can’t be localized to certain passages but implicate the reader’s overall impressions of the text. Instinctual judgments rely on combinations of judgments across many different passages. They further rely on the rater’s history of experience with previous tests and one’s responses to these as appropriate or inappropriate to the task or genre. There is no descriptive vocabulary ready at hand to characterize these whole-text instinctual judgments. They tend rather to be captured in a prescriptive shorthand to register that the text passes threshold (“is appropriate to the genre”) or not (“is not appropriate to the genre). In marked contrast to instinctual judgments in writing assessment, there are also what one might call “discrete” judgments of defective patterns. These are the finegrained judgments that detect errors in the surface text. These errors are “visual” in the sense that they are associated with enumerable defective patterns. These defects can be taught and remedied through perceptual training because they are spatially circumscribed in the textual stream. These defective patterns can furthermore be individuated, tallied and graphically displayed. Raters’ judgments of discrete criteria are rule-governed rather than instinctual. A human rater can detect, and often state a rule for verifying, the presence of a grammatical error at a specific location of text [5]. Further, the existence of the error is independent of what’s happening with other
Genre and Instinct
71
passages across the real estate of the text. This property of independent distribution guarantees that grammatical mistakes and other errors spotted through perception can be arithmetically summed to get a scalable assessment of overall errors in a text. By contrast, instinctual judgments of genre appropriateness are not additive in this way. What makes a text inappropriate to certain genre requirements typically has less to do with raw counts of “defective patterns” in the surface stream and more to do with detecting deviations in the underlying task representations behind the patterns chosen. To judge writing as appropriate or inappropriate to a genre is a judgment about the writer’s failure to meet important, culturally situated requirements that stand behind the words. Halfway between the realm of instinct and the realm of rule in writing assessment are the mid-tier questions of organization and style. Rating these aspects of student writing can be executed from the top-down, the residue of instinctual judgments about appropriateness and coherence, or from the bottom up, the residue of rule-governed judgments about grammaticality, correctness, and cohesiveness. Although this mesolevel of writing assessment is very important, we won’t discuss it further here. Noteworthy for present purposes is that the notion of instinct has been problematic for the institutions of assessment to address. Instinct, after all, seems a flimsy notion on which to base assessment. Whatever the validity of instinct, it seems diametrically at odds with the staid institution of assessment. Assessment relies on elaborated and consensual reasons and accepted rules of validation. Instinct relies on neither. Formal assessment may begin with the gut, but it is not supposed to stop there. Yet, rather than halt assessment in its tracks, writing teachers/evaluators for generations ignored the qualitative differences between judging that a writer had produced an “inappropriate” text and judging that the writer had split an infinitive. Both phenomena, on this longheld commonplace view, could simply be tabulated as “errors.” Dramatic change in this thinking occurred in the 1970s because of the pioneering work of Shaughnessy [16] and her many followers who taught basic writers of English [17, 18]. Through her work, Shaughnessy demonstrated that student errors were more epiphenomenal than phenomenal, that the defective patterns in the surface stream of text were not self-evident aberrations but rather socio-cultural curiosities that were rationally tied to the student’s background of experience and acculturation. For the teacher to help correct the defective patterns, she would need to recover the student’s hidden rationality based on the student’s limited cultural experience and then seek to augment that rationality by extending the student’s practice. The virtue of Shaughnessy’s approach, now dominant among composition researchers in the U.S., was to delve into the black box of appropriateness judgments through the window of defective patterns at the surface. While this approach has strengths, it is not without limitations. One drawback is that, while it can lead to insightful and rigorous analysis, this approach is hard to produce without considerable time and training, and impossible to deploy within the real time constraints of formal assessment. A second, subtler and unintended consequence is that, by making the text epiphenomenal, this approach can potentially sever all relationships between the judgment of textual appropriateness and textual patterns. Some in composition have openly championed this very implication, arguing that “good writing” should be defined by “good teaching” rather than “good texts” [18]. While this view has more merit than it may seem at first, it does, taken to the extreme, deny
72
Y. Hu, D. Kaufer, and S. Ishizaki
the text as an evidentiary source of quality writing, which is counter-intuitive to the very feasibility of written assessment. More important for our purposes, this view discourages looking to see whether instinctual judgments of textual “appropriateness” can be rooted in deep patterns of text.
7 Is Genre Appropriateness Cross-Cultural? To gain fluency in a language is to acquire the instincts through which communicative intentions are realized and genres are enacted. As we have argued, these instincts are buried deep within the lexical clusters of a language and not simply learned through the externalized rules of situations in a culture. We have suggested that learning such rules can improve the performance of genres at the margins. But the essence of genre performance lies less in these externalized rules per se than in the lexical instincts at the core of meaning-making in a language. However, these propositions remain unproved conjectures. To test these propositions, we sought a student population of non-native speakers who had little face-to-face contact with native speakers and little face-to-face exposure to cultural situations in the west. Students taking English courses but who were not English majors at Guizhou University in the remote Guizhou province of China seemed a good pilot population for such a study. In the university level English curricula in China, there is a bifurcation between the curriculum for English majors and non-majors. The major in English is relatively small among the population of Chinese college students and so classes for English majors tend to be small (under 30 and often under 20), comparable to many US English language classrooms for majors or non-majors. Curriculum coverage is also similar, with Chinese English majors getting work in literature and advanced writing, including work in English modes and genres. In standard national tests of student writing in both China [12] and the United States [19, 20], human raters are asked to judge student essays on a multiple-point scale, often 1-6, where a “6” paper is rated highest and a “1” rated lowest [11]. Verbal criteria are given to help raters discriminate high, medium, and low ranking essays along this number system. English majors must pass the Test for English Majors (TEM), which also has a writing component and comes in two varieties. TEM 4 is taken at the end of the sophomore year and must be passed for the student to receive the English major. TEM8 is taken in the senior year and can help determine whether the student is recommended for graduate work in English abroad [12]. Non-majors in China learn English under very different circumstances. China has the largest population in the world and the most aggressive English language learning programs. At the University level, all enrolled students who are not majoring in English must take two China English [Language] tests (CET for short) (CET4 and CET6). Students who fail CET4 are not allowed to graduate. Students who fail CET6 may not be recommended for graduate training or work in other countries [12]. Non-majors congregate in classrooms that can range between 30 and 80 students, making assignments in production (speaking and writing) far less frequent than assignments in listening or reading. This is unfortunate because, whether they get writing practice or not, their performance will be measured on the high-stakes CET tests. Even more
Genre and Instinct
73
importantly, non-English majors are enrolled in a variety of professions where proficiency in English across all modalities will be assumed, including proficiency in writing and professional genres.
8 Prototypical vs. Peripheral Student Drafts We introduce the notion of the prototypical and the peripheral text. A prototypical text is defined in terms of genre characteristics. A genre can be defined as a clustering of lexical clusters that address the recurrent situational needs of the reader [4, 5, 7]. Different genres call for different lexical clusters depending on the situation of the writing and the relationship that the writer seeks to establish with the reader. A prototypical text is always relative to a genre of interest. An information text that exhibits the lexical clustering appropriate to sharing information with a reader is a prototypical information text. A narrative text that exhibits the lexical clustering appropriate to telling a story is a prototypical narrative text. By way of contrast, a text that fails to exhibit the lexical clustering of a target genre can be called peripheral to that genre. The text simply falls “out of range” of what the text is specified to do in the rhetorical situation. We have already presented a technological method for analyzing student texts for their lexical clustering. This technology allows a teacher in a large class of English non-majors to assign a text in a specified genre and to assess, in a fraction of the time it would take to read and mark the texts, the texts as prototypical or peripheral to the assigned genre. Students who are writing texts that are peripheral to the genre likely lack the English lexical clustering patterns to differentiate the genre they are writing in from other genres. Apart from theory testing, a pedagogical extension of our method is that it extracts and isolates the successful patterns of the students able to write prototypical texts. These patterns can then be taught directly to the less successful students. After further training, these students can be invited to revise the assignment, and the technology can reassess whether their text is now moving closer to prototypical status. A second important feature of our approach is that the teacher can assess and provide feedback to students on these important characteristics of genre writing without the labor of line-by-line reading and marking. The teacher instead runs a computer program and statistical methods that take seconds and minutes to run rather than hours and days of tedious labor.
9 Experiments Our research question was, can Chinese students with no training in English genres create prototypical genres of English? To answer this question, we applied the DocuScope technology to three sections of a required course for non-majors taught by the first author. Each section contains students from a different major, and each section was asked to write a 500 word English text in a different genre. The first section, consisting of 38 telecommunication majors, was asked to write a description of their dorm room. The second section, consisting of 27 physics majors, was asked to write a narrative recounting what they did during the week of China’s national 60th birthday
74
Y. Hu, D. Kaufer, and S. Ishizaki
celebration. The third section, consisting of 36 economic majors, was asked to write an information paper which needed to “teach their reader” something. Except for these scant instructions, no further handouts were distributed or instructions given about what it meant to describe, narrate, or inform in English prose. Our interest in this study was to employ these methods as a diagnostic tool. We were primarily interested in seeing whether Chinese students with limited exposure to native speaker cultural situations could produce genre-appropriate (or prototypical) texts. We were also interested, for pedagogical purposes, whether the patterns produced by the capable students could form the basis of instruction for students who did not produce prototypical texts. We must caution at this point that a prototypical text is neither a perfect nor even a grammatically correct text. It simply means that it contains evidence of lexical clustering in the surface text that is expected and appropriate to the situational demands of the writing. A text that is prototypical must still undergo considerable revision to be an acceptable finished draft. At the same time, that revision toward a finished product will be wasted work if the draft to be revised does not exhibit prototypical features of acceptability. To identify the English patterns used by the 101 student texts across the three sections, we parsed them with the DocuScope technology. These patterns were then statistically analyzed to determine how the patterns clustered, whether the clusters resembled predicted lexical clusters of the genres assigned, which students were responsible for these prototypical texts, and which students, writing peripheral texts, created lexical clusters “out of range” of the assigned genres. Prototypical and peripheral texts lie on a continuum. However, for convenience, we focused our analysis on locating the most prototypical text in each classroom and the most peripheral.
10 Results Our main results suggested that rural Chinese students, even with limited exposure to native speaker situations and no training in English genres per se, were able, on average, to make key separations between information, narrative, and descriptive writing. 10.1 Separation between Information and Description Genres Figure 2 reveals one of the main results of the analysis, the separation between information and descriptive writing. Figure 2 shows this separation by plotting the first two factors when we factor-analyzed the patterns for co-occurrence regularities. The horizontal axis of Figure 1 shows the pattern clustering more typical of information writing. Information writing consists of patterns of direct interaction with the reader (e.g., “you will now learn”), stimulating the readers’ curiosity (e.g., “the puzzle to solve is”) and moving along in a sequence (e.g., “firstly”, “secondly,” “thirdly”). The vertical axis of Figure 2 shows the dominant clusters of descriptive writing. Among the students in our sample, descriptive writing is restricted to the present tense and is marked by the use of visual nouns (e.g., book, horse) and adjectives (blue, cuddly) and especially visual phrases (e.g., under the cabinet). The circles represent descriptive writing, the squares narrative writing, and the diamonds information writing. The reader will note that the diamonds (information writing) dominate the upper region of
Genre and Instinct
75
the chart. The circles (descriptive writing) dominate the lower center region. The separation appears in both factors, and both are statistically significant (see Figure caption). We boxed off both clusters as a visual guide to the reader. We were able to isolate one student in the economics section by the name of Li as having written the most prototypical information assignment. Note that Li’s paper is farthest to the right of any other information paper. We were able to isolate another student in the telecommunication section by the name of Qian as having written the most prototypical description assignment. Please note that Qian’s paper is lower than any other descriptive paper. These two students were able to produce a paper most “in range” with the restrictions of the genre assigned. Conversely, for each prototype paper, we could identify a corresponding peripheral paper, a paper whose lexical clustering was most out of range for the assignment. Youfei, an economics student assigned an information paper, wrote a text that clustered as a narrative piece. Peng, a telecommunications student assigned a descriptive paper, wrote a text with few descriptive features and also with relatively few markers of the other genres. -Separation of Information and Descriptive Writing - 3 < e g a u g n a L l a u s i V d e s a B t n e s e r P : 2 F
Pro totype Region Info
2 1
Peng
Description Narrative Information
Li
Youfei
0 -1 Qian -2 Pro totype Region Desc rip
-3 -3 -2 -1 0 1 2 3 F1: Gui ding the R eader Seque ntially Th rough Curi osity ----- >
Fig. 2. Plotting Factor 1 vs. Factor 2. Factor 1 isolates student texts that include guiding the reader, sequence, prescriptive, and curiosity. Factor 1 significantly distinguishes (MANOVA, F = 24.11; p <= 0) the student written information texts from other texts, as the boxed region to the far right indicates. Factor 2 significantly distinguishes texts at the lower end of the factor that provide present-based visual language. This factor significantly distinguishes (MANOVA, F = 13.04; p <= 0) description texts from the other student genres. We have included the names of the students who wrote the seeming most prototypical and peripheral texts for these factors. Li wrote the most prototypic information paper and Qian the most prototypic description paper. Youfei wrote the most peripheral information paper and Peng the most peripheral description paper.
76
Y. Hu, D. Kaufer, and S. Ishizaki
10.2 Prototypes and Peripheries in Information and Description To make this discussion more concrete, let us briefly turn to the prototypical and peripheral information and description texts so that we can better understand what these classifications mean. •
Li, the writer of the prototypical information, instructs his reader on the art of dancing. As we mentioned above [section 2.5], information writing draws, among others, from the interactivity cluster [cluster 1.6]. And, true to form, Li’s prose contains several interactive questions used to lead the reader: “Do you often feel very boring in party or free time?” “Do you want to look like superstar in show time to enjoy yourself?” He uses curiosity-raising language [a subcluster of the interactivity cluster, see 1.6] like “solve” in sentences such as “Are you fascinated by some dance music and don’t know how to let it off? The way I use to solve it is dance, which make me feel I am unique, beautiful and graceful.” As he enters the exposition of how to dance, he labels each step sequentially, a kind of reportage [cluster 1.16] often coincident with information [cluster 1.15]. “First, you should watch some TV show…Second, you should start to train your dance skill…Then after you have got some dance skills…Finally, you should also know.”
•
Rather than lead a reader’s native curiosity through sequenced chunks of information, Youfei tells an autobiographical story [First Person Private Register, cluster 1.13] about his love of music. Note from Figure 3 that Youfei’s text clusters with the narrative texts more than with the information texts. To see how strongly Youfei’s writing functions as narrative, consider the abundant use of temporal adverbs (when) and past tense verbs in his language so strongly associated with narrative writing. We italicize these narrative clusters in the following passages: “When I was eight, I heard my first song – Dong Fanhong”; “When I was young, my hometown was very poor” ; “And at that time, my grandfather bought a tape recording, he was the first man in my village. So I heard my first song from his recording. When I heard the song, I was attracted by it. It possessed me, and it crazed me.” When Youfei does turns away from narrative features in his writing, he does not focus on music as public information but as an object of his personal like and dislike. “I love Zhang Shaohan, because she has a good voice, above all.”
•
Qian produced a prototypical description by writing about her house. She barrages the reader with a trove of visual phrases about her house: “under a hill,” “in the front of the living room,” “made up of glass,” “on the opposite of the window,” “under the TV,” “middle of the room,” “coffee color,” “next to the bed is a wardrobe,” and many more.
•
Rather than describing using visual language, Peng provides a personal introduction to his family. “Now let me introduce my family to you.” He goes into interesting personal reminiscences that strike a narrative more than descriptive tone. “My father and mother opened a store in order to be better off…Now my family has [seen] great changes.”
Genre and Instinct
77
10.3 Prototypes and Peripheries in Narrative The physics students in the narrative class wrote narratives across a broad range of text types that required us to sub-classify narrative prototypes into major and minor versions. Major prototypes have two defining features. First, they illustrate clusters of text features that are distinctive of the genre in question and less distinctive of other genres. Second, they illustrate clusters that many writers producing the genre reproduce. The prototypes for information and description writing we discussed above are major prototypes. A minor prototype has only the first feature. It produces features not commonly found in other genres. Still, it is minor because other writers of the genre do not reproduce this feature frequently, at least in the populations being tested. An example should make this clear. The narrative (physics) students were asked to write about what they did over the week of the National Holiday. An expression making its way into the writing of two students was the pseudo-cleft “What I did was…,” which can be classified as an “account of action” pattern. Zhaoyi, one of these two students, used this expression several times in his writing. Even though few other narrative students used this expression, the expression was never used by writers of information or description. This caused it to rise to statistical attention when it came to associating this form with narrative writing. The major prototype of narrative writing, however, was the use of temporal expressions (e.g., all week, during the day, as evening fell, in the morning, after lunch) to keep the narrative flowing. Indeed, students in the narrative writing condition used these temporal expressions with significantly greater frequency than students in the other conditions. •
Mao’s text on the national holiday is rich with temporal expressions he uses to pace his narrative. For convenience, we italicize them. “On October first, we get together in the dinner room to see the ceremony of reviewing troops…Then soldiers went through the Tiananmen Square. On October third, namely the midautumn festival, I ate moon cake with my classmates....When we came back from Huaxi Park…That time we said "goodbye" to each other…Four days later we begin our class again.
•
In the two previous examples of peripheral texts, we observed students who seemed to have adequate English skills but who were not able to associate the right combination of textual features to the specified genres. Hans appears to be a different case. He seems to understand that a narrative text requires past tense verbs and temporal expressions to pace the narrative. Yet at the same time his English language skills are sufficiently uneven as to prevent our technology from recognizing the patterns he uses as standard signals of narrative. Instead of “On this National Day,” he writes “In this National Day.” Instead of the past tense “sit,” he writes the irregular “sited.” Instead of the temporal adverb “finally,” he writes “at final.” He misspells the past tense “pretended” as “pretened.” Instead of the past aspect “had not found,” he writes “not found.” These are all small mistakes for the beginning non-native speaker, but each individual mistake drops valuable temporal expression, compromising the overall narrative shape of the writing.
78
Y. Hu, D. Kaufer, and S. Ishizaki
Fig. 3. Plotting Factor 3 vs. Factor 6. Factor 3 isolates student texts that range from the use of temporal expression on the low end of the factor to self-disclosures, acknowledging, updating, and questioning on the high end. This means that the further to the left a text falls, the more temporal expressions it contains and the more it resembles a narrative. Because of its involvement with temporal expression, factor 3 significantly distinguishes (MANOVA, F = 3.49; p <= 0) the student description texts from narrative texts, as the boxed regions indicate. Notice how the narrative texts (squares) congregate to the left side of the chart and the description papers (circles) fall on their right. Factor 6 isolates texts where students account for their actions, and this factor also significantly distinguishes (F = 4.54; p <= 0.013) narrative texts from descriptive texts. Notice how the narrative papers appear above most of the descriptive papers (circles). We have included the names of the students who wrote the most prototypical and peripheral texts for the narrative paper. Mao wrote the most prototypical narrative paper under the definition of a major prototype. Han, on the other hand, wrote the most peripheral paper. As one can see, his paper falls significantly to the right of the major cluster of narrative paper, indicating he is lacking the temporal expressions so vital to standard narrative writing.
11 Theoretical Implications Dominant theories of genre link genre capacity to the knowledge of embedded cultural situations. While we do not deny the importance of such knowledge to genre performance, we maintain it improves genre performance only at the margins. Far more basic to genre knowledge and performance are instincts about lexical clustering and its relationship to audience experience in a language. While the relationship between lexical clustering and communication instincts varies across languages, we suggest that the instincts themselves are likely cross-cultural. Part of acquiring fluency in a language is to map one’s native instincts onto the particular lexical clusters of the target language that is responsible for them. We have sought to make a prima facie case for these theoretical assertions by studying Chinese students in rural China
Genre and Instinct
79
with limited exposure to native English speakers, English language writing, or English language cultural situations. Despite their limited exposures to English language cultures, the students had sufficient fluency with English to produce writing that, on average, statistically separated some major genres of English. This suggests that communication instincts relevant to separating one genre from another are part and parcel of fluency training and not reliant on the externalized context-specific rules of particular cultural situations. While we don’t mean to marginalize the place of culture in genre training, we do mean to call attention to and correct the marginalization of what we are calling communication instincts in the development of fluency in written genres. Our experiments suggest there exists a “deep lexicon” in production knowledge that relies not only on single entries, but also on their use in combinations to create gross and subtle audience experiences. To learn to speak and write with sophistication in any language requires the mastery of these combinations. Hoey’s work on lexical priming has independently identified the importance of a deep lexicon in language learning [21], and it will be useful to integrate his work more fully into the findings reported here.
12 Pedagogical Implications We have introduced a technological method for accessing basic genre differentiation skills in the writing of non-English majors in the Chinese University System. The significance of this method for instinctual computing is that judgments about genre appropriateness are not rule-governed and require gut reactions that can be associated with instinct. While there are many domains where humans would not want to trust the instincts of a computer program over their own, because of the sheer diversity and complexity of English language strings in a single text that can help determine its genre, there are reasons for thinking that a computer’s instincts about genre can become as well honed as a human’s and much quicker. This is good news if practical methods to serve English language learners in high volume countries like China are to be devised. This method holds promise for several reasons: First, non-English majors will need to read and write in various specialized disciplines, yet non-majors get little first-hand practice in written English genres in their formal classroom instruction. Second, the classes of non-majors tend to be large, straining the teacher’s capacity to assign writing, much less offer students individual feedback at the rate they require. Third, the method described in this paper offers the teacher a way of understanding whose English competencies are sufficiently robust to differentiate genres in their own writing and whose are not. Students who are found to produce prototypical texts or to approximate the prototype through this technology have just reason to think they can “control the experience” of the English reader with more signals than noise. Such students thus have more motivation to better understand the further noise reduction they can achieve by reducing their grammar, usage, spelling and other writing mistakes. Students who are found to produce peripheral texts or to approximate such texts can learn the patterns they have been missing from the prototype texts in order to get the genre discrimination of the prototypes. As with the case of Hans above, the problem may not be with the absence of English patterns as much as with the high occurrence of so many traditional mistakes that the student’s patterns are not recognizable to the
80
Y. Hu, D. Kaufer, and S. Ishizaki
technology. In either case, diagnosing the source of the student’s problem in such cases is made relatively straightforward through the introduction of the technology. Future educational research should look into automating the student submission, tagging, and statistical graphing phases so that one can scale up this technology and do more formal testing across a larger population of classrooms. More formal studies comparing human and computer instincts about genre identification are also needed. Although a critical mass of Chinese students were able to create prototypical drafts without any instruction in genre or pattern training, many students also created peripheral texts. As part of fluency training, further research needs to be conducted about optimal ways for students to acquire lexical patterns relevant to genre expression and differentiation. Acknowledgements. We wish to thank the students of Professor Hu’s classes, who kindly participated in this study. We would like to thank the English faculty of Guizhou University and Beihang University who gave their useful feedback on this research. We would also like to thank the Business faculty of Sydney Technological University who also gave their useful feedback on this research. We especially wish to thank Yang Cai for his encouragement throughout the editorial process and Emily Durbin for her careful editing.
References 1. Pinker, S.: The Language Instinct: How the Mind Creates Language. William Morrow, New York (1994) 2. Sampson, G.: The ”Language Instinct” Debate. Continuum Press, London (2005) 3. Midgley, M.: Beast and Man: The Roots of Human Nature. Harvester Press, London (1979) 4. Berkenkotter, C., Huckin, T.N.: Genre Knowledge in Disciplinary Communication. Lawrence Erlbaum and Associates, Hillsdale (1994) 5. Bonini, C.B.A., Figueiredo, B. (eds.): Genre in a Changing World: Perspectives on Writing. The WAC Clearinghouse and Parlor Press, Fort Collins, Co. (2009) 6. Devitt, A.J.: Writing Genres. Southern Illinois University Press, Carbondale (2004) 7. Bawarshi, A.: Genre and the Invention of the Writer: Reconsidering the Place of Invention in Composition. Utah State University, Logan (2003) 8. Kaufer, D., et al.: The Power of Words: Unveiling the Speaker and Writer’s Hidden Craft. Lawrence Erlbaum, Mahwah (2004) 9. Givón, T.: Topic Continuity in Discourse: An Introduction. In: Givón, T. (ed.) Topic Continuity in Discourse: a Quantitative Cross-Language Study, Benjamins, Amsterdam, pp. 5–41 (1983) 10. Kaufer, D., et al.: Teaching Language Awareness in Rhetorical Choice Using IText and Visualization in Classroom Genre Assignments. Journal for Business and Technical Communication 18(3), 361–402 (2004) 11. Spandel, V.: Creating Writers Through 6-Trait Writing Assessment and Instruction, 5th edn. Allyn and Bacon, Boston (2008) 12. Zang, J.: Certification Programs in China. Translation Journal (October 2006), http://accurapid.com/journal/38certific.htm
Genre and Instinct
81
13. Kaufer, D.: Ambient Intelligence for Scientific Discovery. In: Cai, Y. (ed.) Ambient Intelligence for Scientific Discovery. LNCS (LNAI), vol. 3345, pp. 129–151. Springer, Heidelberg (2005) 14. Knuth, D.J., Morris, J.H., Pratt, V.: Fast Pattern Matching in Strings. SIAM Journal on Computing 6(2), 323–350 (1977) 15. Kaufer, D., Butler, B.: Rhetoric and the Arts of Design. Lawrence Erlbaum, Mahwah (1996) 16. Shaughnessy, M.: Errors and Expectations: A Guide for the Teacher of Basic Writing. Oxford University Press, New York (1977) 17. Bartholomae, D.: The Study of Error. College Composition and Communication 31(3), 253–269 (1980) 18. Belanoff, P.: The Myths of Assessment. Journal of Basic Writing 10(1), 54–66 (1991) 19. Weigle, S.C.: Assessing Writing. Cambridge University Press, Cambridge (2002) 20. Huot, B., O’Neil, P.: Assessing Writing: A Critical Sourcebook, Bedford St. Martins, Boston (2008) 21. Hoey, M.: Lexical Priming: A New Theory of Words and Language. Routledge, London (2005)
Intuition as Instinctive Dialogue Daniel Sonntag German Research Center for Artificial Intelligence 66123 Saarbr¨ ucken, Germany
[email protected]
Abstract. A multimodal dialogue system which answers user questions in natural speech presents one of the main achievements of contemporary interaction-based AI technology. To allow for an intuitive, multimodal, task-based dialogue, the following must be employed: more than explicit models of the discourse of the interaction, the available information material, the domain of interest, the task, and/or models of a user or user group. The fact that humans adapt their dialogue behaviour over time according to their dialogue partners’ knowledge, attitude, and competence poses the question for us what the influence of intuition in this natural human communication behaviour might be. A concrete environment, where an intuition model extends a sensory-based modelling of instincts can be used and should help us to assess the significance of intuition in multimodal dialogue. We will explain the relevant concepts and references for self-study and offer a specific starting point of thinking about intuition as a recommendation to implement complex interaction systems with intuitive capabilities. We hope this chapter proposes avenues for future research to formalise the concept of intuition in technical, albeit human-centred, AI systems.
1
Introduction
Artificial Intelligence (AI) helps to solve fundamental problems of human computer interaction (HCI) technology. When humans converse with each other, they utilise many input and output modalities in order to interact. These include gestures, or mimicry (including drawings or written language, for example), which belong to non-verbal communication. The verbal communication mode is the spoken language. Some modes of communication are more efficient or effective for certain tasks or contexts. For example, a mobile user interface could be addressed by spoken language in contexts where someone is already busy with his hands and eyes (for example, while driving) or simply to avoid tedious text input modes on mobile devices such as smartphones. AI techniques [1] can be used to model this complex interaction behaviour and Machine Learning (e.g., see [2]) plays a significant role in the modelling of content information over time. We are almost certain that multimodal dialogue-based communication with machines will become one of the most influential AI applications of the future. Who wouldn’t like to speak freely to computers and ask questions about clicked or pointed items, especially when the questions can be answered in real-time Y. Cai (Ed.): Computing with Instinct, LNAI 5897, pp. 82–106, 2011. c Springer-Verlag Berlin Heidelberg 2011
Intuition as Instinctive Dialogue
83
with the help of search engines on the World Wide Web or other information repositories? Eventually, the role of multimodal dialogue systems may shift from merely performance enhancers (voice input is fast and convenient on mobile devices) toward guides, educational tutors, or adaptable interfaces in ambient intelligence environments where electronic agents are sensitive and responsive to the presence of users (also cf. [3]). Multimodal dialogue systems as intelligent user interfaces (IUIs) may be understood as human-machine interfaces that aim towards improving the efficiency, effectiveness, and naturalness of human-machine interaction. [4] argued that human abilities should be amplified, not impended, by using computers. In order to implement these properties, explicit models of the discourse of the interaction, the available information material, the domain of interest, the task, and/or models of a user or user group have to be employed. But that’s not everything. Dialogue-based interaction technology, and HCIs in general, still have plenty of room for improvements. For example, dialogue systems are very limited to user adaptation or the adaptation to special dialogue situations. Humans, however, adapt their dialogue behaviour over time according to their dialogue partners’ knowledge, attitude, and competence. This is possible because humans’ abilities also include (1) the emotions that are expressed and perceived in natural human-human communication, (2) the instinctive actions and reactions a human dialogue participant performs, and (3) the metacognitive and self-reflective (introspective) abilities of a human dialogue participant to cleverly reason about the actions she or he takes [5]. Intuition is widely understood as non-perceptual input to the decision process: in philosophy, [intuition is] the power of obtaining knowledge that cannot be acquired either by inference or observation, by reason or experience. (Encyclopdia Britannica). In this chapter, we ask two things: first, how does intuition influence natural human communication and multimodal dialogue, and, second, how can intuition be modelled and used in a complex (dialogue-based) interaction system? We think that by providing a concrete environment in which an intuition model extends the sensory-based modelling of instincts to be used (multimodal sensors reveal current-state information which triggers direct reaction rules), we can assess the significance of intuition in multimodal dialogue and come one step closer to capturing and integrating the concept of intuition into a working HCI system. In section 2 we will introduce the application environment, i.e., the multimodal dialogue systems for implementing intuition in dialogue. A dialogue example describes dialogue topics, topic shifts, and an intuitive development of a common interest. Section 3 discusses dialogue adaptivity based on a context model which includes user modelling, and dialogue constraints and obligations. Section 4 provides a definition of instinctive dialogues based on multimodal sensory input and the recognition of moods and emotions. Section 5 provides the reader with our main contribution, a model for implementing intuitive dialogue which includes a self-reflective AI model. In section 6 we come to a conclusion.
84
2
D. Sonntag
Multimodal Dialogue Environment
Natural Language Processing (NLP) is a wide sphere1 . NLP includes the processing of spoken or written language. Computational Linguistics is a related field; it is a discipline with its roots in linguistics and computer science and is concerned with the computational aspects of the human language faculty. Dialogue systems research is a sub-discipline of computational linguistics and works at allowing users to speak to computers in natural language. Spoken dialogue technology (as a result of dialogue systems research) additionally includes related engineering disciplines such as automatic speech recognition [7] and is the key to the conversational (intelligent) user interface, as pointed out in [8, 9]2 .
Fig. 1. Applications of spoken dialogue technology towards multimodal dialogue in mobile or ambient intelligence environments
Multimodal dialogue systems allow dialogical inputs and outputs in more than just one modality and go beyond the capabilities of text-based Internet search engines or speech-based dialogue systems. Depending on the specific context, the best input and output modalities can be selected and combined. They belong to 1 2
[6] discusses a comprehensive set of some topics included in this sphere. Also see [10] for an introduction to relevant topics, such as dialogue modelling, data analysis, dialogue corpus annotation, and annotation tools.
Intuition as Instinctive Dialogue
85
the most advanced intelligent user interfaces [11] in comparison to text-based search engines3 . Some prominent end-to-end multimodal dialogue system are Janus [13], Verbmobil [14], Galaxy and Darpa Communicator [15–17], Smartkom [18] and SmartWeb [19–21]. Figure 1 shows different input and output modalities of interfaces used in multimodal dialogue systems as applications of multimodal dialogue in mobile or ambient intelligence environments. We will start by explaining the interaction with multimodal dialogue systems according to Julia Freeman’s specious dialogue (specious means apparently good or right though lacking real merit) artwork4. In her concept, dialogues are embodiments and consist of pairs of movable, sculptural forms (like physical dialogue agents, cf. the reproduced illustration in Figure 1, left). They can play a multitude of roles such as lurking in corners or shouting at visitors, or “... they will expect to be touched or moved in some way, at the very least they will want to be listened or spoken to” (Freeman). Interestingly, this simple conception leads us to two very dominant roles of multimodal dialogue systems. First, the dominant interaction mode of spoken language, and second, the dominant social role of using haptics and touch to establish a relationship with natural and artificial things. The tradition of spoken dialogue technology (figure 1, upper left) reflects this dominant interaction role. Telephone-based dialogue systems, where the automatic speech recognition (ASR) phase plays the major role, were the first (monomodal) systems which were invented for real application scenarios, e.g., a travel agent speaking by phone with a customer in a specific domain of Airline Travel Planning [22]5 . Speech-based question answering (QA) on mobile telephone devices is a natural extension of telephone-based dialogue systems [23]. Additionally, new smartphones should be able to answer questions about specific domains (e.g., the football domain) in real-time. User: “Who was world champion in 1990?”—System: “Germany.” Other directions (figure 1, down right) are the use of virtual human-like characters and multimodal, ambient intelligence environments where cameras detect human hand and finger gestures [24]. With the recent advent of more powerful mobile devices and APIs (e.g., the iPhone) it is possible to combine the dialogue system on the mobile device with a touchscreen table interaction. Multiple users can organise their information/knowledge space and share information with others, e.g., music files. These interaction modes combine Internet terminals and touchscreen access with mobile physical storage artefacts, i.e., the smartphones, and allow a casual user to search for and exchange music in an intuitive way. The role of intuition will be rendered more precisely in the rest of this chapter. In general, the interaction between the user and the multimodal dialogue system can either be user-initiative, system-initiative, or mixed-initiative. In user-initiative systems, the user has to utter a command before the system 3 4 5
An introduction to multimodal dialogue processing can be found in [12]. http://www.translatingnature.org The German Competence Center for Language Technology maintains a list of international research projects and available software in this area. Collate is one of these projects (http://collate.dfki.de/index en.html).
86
D. Sonntag
starts processing. Hence, the focus is more or less on the correct interpretation of the user’s utterance; the new user intention, therefore, drives the behaviour of the dialogue system. A typical example can be found in routeplanning in the transportation domain. For example, User: “I need to get to London in the afternoon”—System: “Ok. Query understood.” The system can react by taking initiative and gathering the necessary information for the decision of which mode of transportation would be preferable. In system-initiative systems, the user and the system must have a collaborative goal, and after initialisation (the user question), the system basically asks for missing information to achieve this goal. In [25], mixed-initiative is defined as: “[...] the phrase to refer broadly to methods that explicitly support an efficient, natural interleaving of contributions by users and automated services aimed at converging on solutions to problems.” This basically means that the sub-goals and commitments have to come from both parties and be cleverly fulfilled and/or negotiated. Why is this distinction in dialogue initiative so important to us for a better understanding of intuitive multimodal dialogue systems? First, we need to interpret others’ actions properly in order to respond or react in an expected and collaborative way. This also means that we must be able to interpret the input consistently in accordance with one’s beliefs and desires. There are active and passive modes of conversation. Especially the passive modes are often unintentional, but we can perceive them instinctively. In the first instance, the dialogue system’s perception of the passive input modes, e.g., a gaze through multimodal sensories, allows it to maintain a model of instinctive dialogue initiative as system-initiative. Second, it is important to point out that the user/system initiative is not only about natural language. Multiple input and output signals in different modalities can be used to convey important information about (common) goals, beliefs, and intentions. In most of our examples, dialogue is in service of collaboration. However, the goal is still to solve a specific problem. Dialogue initiative, and mixed-initiative interaction, arises naturally from the joint intentions and intuitions about how to best address the problem solving task, thereby forming a theory of collaborative activity. The following human-human dialogue illustrates the introduction of a dialogue topic, a topic shift, and a development of a common interest that is pursued as the goal of the conversation. A: “Hey, so are you going to see that new movie with Michael Jackson?” (topic introduction) B: “You mean ‘This Is It’ ?” A: “Yeah, I think that’s what it’s called.” B: “When is it coming out?” A: “It should be out in about a week. I really hope they also include some of the songs from the Jackson Five years. Do you like them, too?” (topic shift) 6. B: “I really like the Jackson Five! It’s too bad the solo albums from Jermaine Jackson never became that popular.” 7. A: “Exactly! Did any of the other members of the group produce solo albums?” (development of common interest)
1. 2. 3. 4. 5.
Interaction technology and HCIs are extremely popular. However, the technology is still in a stage of infancy, especially when it comes to the dialogue management task and the modelling of interaction behaviour as demonstrated in the
Intuition as Instinctive Dialogue
87
dialogue example. Especially the development of a common interest demands a fine-grained dialogue state model and intuitive capabilities to perceive the common interest. HCIs are also popular in the context of (effective) information retrieval. Advanced user interfaces, such as multimodal dialogue interfaces, should provide new solutions. The Information Retrieval (IR) application domain [26, 27] is very suitable for making a list of current challenges of multimodal dialogue for which a model of intuition should help to properly address these challenges. In the context of information retrieval interaction (also cf. [28]), we often work with systems that: – Cannot recognise and develop common goals (as demonstrated in the dialogue example) – Use canned dialogue segments (e.g., “The answer to your query [input x] is [input y].”); – Use hardwired interaction sequences (no sub-dialogues or clarifications possible); – Do not use inference services (new information cannot be used to infer new knowledge); and – Have very limited adaptation possibilities (e.g., no context adaptation is possible). Combined methods to overcome the limitations include: (1) obeying dialogue constraints, (2) using sensory methods, and (3) modelling self-reflection and adaptation to implementing intuition (chapter 5). We can work against the limitations of current multimedia and HCI technology by exploiting dialogue systems with special (metacognitive) abilities and interaction agents that can simulate instincts and use intuitive models. We distinguish foraging, vigilance, reproduction, intuition, and learning as the human basic instincts (also cf. [29]). However, foraging and reproduction have no embodiment in contemporary AI for interaction technology and HCIs. Figure 2 shows the dependencies and influences of adaptivity and instincts of a dialogue environment towards implementing intuition in dialogue. Adaptive dialogue might be the first step to overcome the difficulties. A context model influences the adaptive dialogue possibilities. Adaptive dialogue is a precondition for implementing instinctive dialogue since instincts are the triggers to adapt, e.g., the system-initiative. Multimodal sensory input influences instinctive dialogue. Instinctive dialogue is necessary for implementing intuitive dialogue. Instincts provide the input for a higher-order reasoning and intuition model. A self-reflective model/machine learning (ML) model influences intuitive dialogue. The strong relationship between instinctive computing and multimodal dialogue systems should enable us to introduce the notion of intuition into multimodal dialogue. Implementing intuition in dialogue is our main goal, and, as we will explain in the rest of this chapter, instinctive dialogue in the form of an instinctive dialogue initiative is seen as the precondition for implementing intuition.
88
D. Sonntag
Fig. 2. Multimodal dialogue system properties: dependencies and influences of adaptivity and instincts of a dialogue environment towards implementing intuition in dialogue
3
Adaptive Dialogue
Adaptive dialogue systems can handle errors that occur during dialogue processing. For example, if the ASR recognition rate is low (i.e., the user utterances cannot be understood), an adaptive system can proactively change from user/mixed-initiative to system-initiative and ask form-filling questions where the user only responds with single open-domain words like surnames, “Yes”, or “No”. Adaptive dialogue systems allow these changes in dialogue strategies not only based on the progression of the dialogue, but also based on a specific user model. Additionally, specific dialogue obligations constrain the search space of suitable dialogue strategies [30–34]. User modelling and dialogue constraints will be explained in more detail in the following two subsections. 3.1
User Modelling
Incorporation of user models (cf., e.g., a technical user modelling architecture in [35]) helps novice users to complete system interaction more successfully and quickly as much help information is presented by the system. Expert users, however, do not need this much help to perform daily work tasks. As help information and explicit instructions and confirmations given by the system increase the number of system turns, communications get less efficient. Only recently has user-modeling-based performance analysis been investigated. For example, [36] tried to empirically provide a basis for future investigations into whether adaptive system performance can improve by adapting to user uncertainty differently based on the user class, i.e., the user model. Likewise, [37] argue that NLP systems consult user models in order to improve their understanding of users’ requirements and to generate appropriate and relevant responses. However, humans often instinctively know when their dialogue contribution is appropriate (e.g., when they should speak in a formal meeting or reduce the length of their contribution), or what kind of remark would be relevant in a specific dialogue situation.
Intuition as Instinctive Dialogue
3.2
89
Dialogue Constraints and Obligations
Dialogue constraints subsume four constraint types: linguistic constraints (e.g., correct case and number generation), correct dialogue acts as system responses (cf. adjacency pairs, answers should follow user questions, for example), timing constraints, and constraints on the information content itself (e.g., information to be presented should generally follow Grice’s maxims and the users’ presumptions about utterances; information should be available in an appropriate quantity). In a dialogue, participants are governed by beliefs, desires, intentions (BDI), but also obligations. Beliefs represent the current state of information and are what a dialogue participant believes in terms of the other dialogue participant(s) and her world in general. Desires are what the dialogue participant would like to accomplish; they represent her source of motivation and are rather general. It is possible to have two or more desires which are not possible simultaneously (for example, wanting to go to Bombay and Paris on the next vacation). Intentions describe what the dialogue participant has chosen to do. At this point, the dialogue participant has already decided on a desire to pursue. Obligations are the results of (largely social) rules or norms by which a dialogue participant lives. They play a central role when understanding and creating dialogue acts for a natural dialogue because they account for many of the next user or system moves during the dialogical interaction. Usually, a general cooperation between dialogue participants is assumed based on their intentions and common goals. However, this assumption fails to acknowledge a not-infrequent uncooperativeness in respect to shared conversational goals. Recognising this dialogue behaviour in other dialogue participants can help develop more effective dialogues toward intuition in dialogue by changing the dialogue strategy. Obligations are induced by a set of social conventions. For instance, when a dialogue participant asks a question, the obligation is to respond. It is therefore relevant how obligations can be identified, which rules can be developed out of this knowledge, and how the system can properly respond. For example, when a participant uses the interjection “uh” three times, the system responds by helping the user, or when a participant speaks, the system does not interrupt. We can say that the system adapts to the user model while at the same time obeying the dialogue obligations. (Also see [38] for a list of social discourse obligations.) The following lists give examples for important social obligations we encounter in terms of instinctive and intuitive dialogue when assuming that the conversational goal supports the task of the user (also cf. [34], page 149ff.). The core social discourse obligations are: 1. Self-introduction and salutation: “Hi there.” “Hello.” “Good morning/ evening.” “Hi. How can I help you?” “What can I do for you?” “Hi [name], how are you today?” 2. Apology: “I’m so sorry.” “I’m sorry.” “It seems I’ve made an error. I apologise.” 3. Gratitude: “Thank you so much.” “I appreciate it.” “Thank you.” 4. Stalling and pausing: “Give me a moment, please.” “One minute.” “Hang on a second.”
90
D. Sonntag
In the context of information-seeking multimodal dialogue, extra-linguistic inputs/outputs (i.e., anything in the world outside language and the language modality, but which is relevant to the multimodal utterance) and social obligations have particular multimodal implementations:6 1. Maintaining the user’s attention by reporting on the question processing status (+ special gestures in case of embodied agents, e.g., looking into the others’ eyes; + special mimics). For example, if a process is successful, embodied agents can express joy. If the query cannot be processed satifactorily, the linguistic output can be accompanied with expressions of shame (figure 3): – “Well, it looks like this may take a few minutes.” – “I’m almost done finding the answer to your question.” – “Give me a moment and I’ll try to come up with something for you.” – “I’ll have your answer ready in a minute.” – “I am getting a lot of results so be patient with me.” 2. Informing the user about the probability of query success, i.e., the probability that the user is presented with the desired information, or informing the user as to why the current answering process is due to fail (+ special gestures such as shaking one’s head or to shrugging one’s shoulders):
Fig. 3. Emotional expressions in embodied virtual agents. These examples are taken from the VirtualHumans project, see http://www.virtual-human.org/ for more information.
– “There is a good chance that I will be able to come up with a number of possibilities/solutions/answers for you.” – “I will probably get a lot of results. Maybe we can narrow them down already?” 6
Extra-lingustic universals in communication and language use are also described in the context of politeness, see [39].
Intuition as Instinctive Dialogue
91
– “Sorry, but apparently there are no results for this enquiry. I can try something else if you like.” – “This search is taking longer than it should. I probably won’t be able to come up with an answer.” – “It seems that I can only give you an answer if I use a special service/program. This is not a free service, though.” 3. Balancing the user and system initiative (equipollent partners should have the same proportion of speech in which they can point something out).
4
Instinctive Dialogue
We hypothesise that underlying maxims of conversation and the resulting multimodal dialogue constraints may very much be related to instinctive computing [29]. The instincts discussed here are vigilance, learning, and intuition, whereby intuition is considered a cognitively more advanced form of intelligence which builds on the input of the other instincts, vigilance and learning. Human-centred human-computer interaction strategies are applied to enable computers to unobtrusively respond to the user-perceived content. These strategies can be based on instinctive reactions which take vigilance and learning into account. We attempt to shed light on the relationship between instinctive computing and state-of-the-art multimodal dialogue systems in order to overcome the limitations of contemporary HCI technology. Instincts relate to dialogue constraints (as explained in chapter 3), and the attempts to convey them, in order to make HCIs more intelligent. Linguistic features (syntax, morphology, phonetics, graphemics), para-linguistic features (tone, volume, pitch, speed, affective aspects), and extra-linguistic features (haptics, proxemics7 , kinesics, olfactics, chronemics) can be used to model the system state and the user state (e.g., emotional state or stress level). Multimodal sensory input recognition, and the recognised affective states (e.g., laughing) and emotions, play the central roles in instinctive dialogue. 4.1
Multimodal Sensory Input
Sensors convert a physical signal (e.g., spoken language, gaze) to an electrical one that can be manipulated symbolically within a computer. This means we interpret the spoken language by ASR into text symbols, or a specific gaze expression into an ontology-based description, e.g., the moods exuberant and bored. An ontology is a specification of a conceptualisation [40] and provides the symbols/names for the input states we try to distinguish. As mentioned before, there are passive and active sensory inputs. The passive input modes, such as anxiety or indulgence in a facial expression, roughly correspond to the ones perceived instinctively. (This is also consistent with our definition of intuition since many passive sensory input modes are not consciously 7
“The branch of knowledge that deals with the amount of space that people feel is necessary to set between themselves and others.”(New Oxford Dictionary of English).
92
D. Sonntag
perceived by the user.) The active input modes, on the other hand, are mostly the linguistic features such as language syntax and morphology. These convey the proposition of a sentence (in speech theory, this means the content of a sentence, i.e., what it expresses in the specific context). The passive input modes are, however, more relevant for modelling users’ natural multimodal communication patterns, e.g., variation in speech and pen pressure [33]. That is, users often engage in hyperarticulate speech with computers (as they would if talking to a deaf person). Because they expect computers to be error-prone, durational and articulatory effects can be detected by ASR components (also cf. [41]). Multimodal interaction scenarios and user interfaces may comprise many different sensory inputs. For example, speech can be recorded by a Bluetooth microphone and sent to an automatic speech recogniser; camera signals can be used to capture facial expressions; the user state can be extracted using biosignal input, in order to interpret the user’s current stress level (e.g., detectors measuring the levels of perspiration). Stress level corresponds to an instinctive preliminary estimate of a dialogue participant’s emotional state (e.g., anger vs. joy). In addition, several other sensory methods can be used to determine a dialogue’s situational and discourse context—all of which can be seen as an instinctive sensory input. First, the attention detection detects the current focus of the user by using onview/off-view classifiers. If you are addressed with the eyes in, e.g., a multi-party conversation, you are more vigilant and aware that you will be the next to take over the dialogue initiative. Therefore you listen more closely; a computer may activate the ASR (figure 4, right).
Fig. 4. Anthropocentric thumb sensory input on mobile touchscreen (left) and two still images illustrating the function of the on-view/off-view classifier (right)
Passive sensory input, e.g., gaze, still has to be adequately modelled. People frequently direct their gaze at a computer when talking to a human peer. On the other hand, while using mobile devices, users can be trained to direct their gaze toward the interaction device (e.g., by providing direct feedback for the user utterance). This can enhance the usability of an interaction device enormously when using an open-microphone engagement technique with gaze direction input. Although good results have been obtained for the continuous listening for unconstrained spoken dialogue [42], passive multimodal sensory inputs offer additional triggers to direct speech recognition and interpretation.
Intuition as Instinctive Dialogue
93
In addition, people have limited awareness of the changes they make when addressing different interlocutors, e.g., changes in amplitude are not actively perceived. On the other hand, the addressee of the message may be very sensitive to changes in amplitude he perceives. For example, think of how some people’s voices change when they speak to small children or their spouses. ASR components can be trimmed to emphasis detection. An important theoretical framework that the research in, e.g., [33] builds on is that the hypo-to-hyper emphasis spectrum is characteristic not only for speech (e.g., hyper-articulation and hyper-amplitude), but for all modes of communication. In particular, a test can be constructed around whether an utterance was intended as a request to the computer. The result would be an instinctive addressee sensor the multimodal dialogue system could use as passive trigger for, e.g., system initiative. 4.2
Recognising Moods and Emotions
Feelings and emotions have been discussed in relevant psychological literature, e.g., [44]. More modern, technically grounded works speak of moods and emotions in the mind, e.g., [45]. We use the distinction between moods and emotions (figure 5). The realisation of emotions (in embodied virtual characters, such as the ones in figure 8, right) for speech and body graphics has been studied, e.g., in [46]. [43] used more fine-grained rules to realise emotions by mimicry, gesture, and face texture. Moods are realised as body animations of posture and gestures. Affective dialogue systems have been described in [47]8 .
Fig. 5. Moods and emotions according to [43]
8
Important related work comprises [48], who outlines a general emotion-based theory of temperament. [49] deals with the cognitive structure of emotions, whereby [50] introduces a five-factor model of personality and its applications.
94
D. Sonntag
Computational models for emotions are explained in [51] and bring together common-sense thinking and AI methods. In multimodal dialogue, [52] discusses emotion-sensing in life-like communication agents and [53] describes the history of the automatic recognition of emotions in speech while presenting acoustic and linguistic features used. There are many interesting new projects for this purpose, too: Prosody for Dialog Systems9 investigates the use of prosody, i.e., the rhythm and melody of speech in voice input, to human-computer dialogue systems. Another project, HUMAINE10 , aims to develop systems that can register, model, and/or influence human emotional and emotion-related states and processes (it works towards emotion-oriented interaction systems). EMBOTS11 create realistic animations of nonverbal behaviour such as gesture, gaze, and posture. In the context of corpus-based speech analysis, [54] find a strong dependency between recognition problems in the previous turn (a turn is what a dialogue partner says until another dialogue partner rises to speak) and user emotion in the current turn; after a system rejection there are more emotional user turns than expected. [55] address a topic and scenario that we will use to formulate intuitive dialogue responses in an example dialogue, i.e., the modelling of student uncertainty in order to improve performance metrics including student learning, persistence, and system usability. One aspect of intuitive dialogue-based interfaces is that one of the goals is to tailor them to individual users. The dialogue system adapts (intuitively) to a specific user model according to the sensory input it gets. 4.3
Towards Intuition
When it comes to implementing intuition, we expect an instinctive interaction agent to deliver the appropriate sensory input. A useful and cooperative dialogue in natural language would not only combine different topics, heterogeneous information sources, and user feedback, but also intuitive (meta) dialogue—initiated by the instinctive interaction agent. Many competences for obeying dialogue constraints fall into the gray zone between competences that derive from instincts or intuition. The following list enumerates some related aspects: – As mentioned before, instinctive and intuitive dialogue interfaces should tailor themselves to individual users. The dialogue system adapts intuitively to a specific user model according to the sensory input it gets instinctively. – Different addressees can often be separated intuitively, not only by enhancing the intelligibility of the spoken utterances, but by identifying an intended addressee (e.g., to increase amplitude for distant interlocutors). In this context, gaze represents a pivotal sensory input. [56] describes an improved classification of gaze behaviour relative to the simple classifier“the addressee is where the eye is.” 9 10 11
http://www.speech.sri.com/projects/dialog-prosody http://emotion-research.net http://embots.dfki.de
Intuition as Instinctive Dialogue
95
– Intuitive dialogue means using implicit user communication cues (meaning no explicit instruction has come from the user). In this case, intuition can be defined as an estimation of a user-centred threshold for detecting when the system is addressed in order to (1) automatically engage, (2) process the request that follows the ASR output in the dialogue system, and (3) finally respond. – With anthropocentric interaction design and models, we seek to build input devices that can be used intuitively. We recognised that the thumb plays a significant role in modern society, becoming humans’ dominant haptic interactor, i.e., main mode of haptic input. This development should be reflected in the interface design for future HCIs. Whether society-based interaction habits (e.g., you subconsciously decide to press a doorbell with your thumb) can be called an instinctive way of interacting, is just one aspect of the debate about the relationship between intuition and instincts (figure 4, left). The combination of thumb sensory input and on-view/off-view recognition to trigger ASR activation is very intuitive for the user and reflects instinctive capabilities of the dialogue system toward intuition capabilities. – Intuitive question feedback (figure 6), i.e., graphical layout implementation for user queries, is a final aspect (generation aspect) of instinctive and intuitive dialogue-based interfaces. The question “Who was world champion in 1990?” results in the augmented paraphrase Search for: World champion team or country in the year 1990 in the sport of football, division men. Concept icons (i.e., icons for the concepts cinema, book, goal, or football match, etc.) present feedback demonstrating question understanding (a team instance is being asked for) and answer presentation in a language-independent, intuitive way. The sun icon, for example, additionally complements a textual weather forecast result and conveys weather condition information. Intuition can be seen as instinctive dialogue. Intuition can also be seen as cognitively advanced instincts.
Fig. 6. Intuitive question feedback
96
5
D. Sonntag
Intuitive Dialogue
The definition of intuition we introduced—the power of obtaining knowledge that cannot be acquired either by inference or observation, by reason or experience— is unsustainable in the context of technical AI systems. The previous sections have given examples of intuitive behaviour only made possible by inference or observation. Instead we say that intuition is based on inference or observation, by reason or experience, but the process happens unconsciously in humans. This gives us the freedom to use technical (meta) cognitive architectures towards technical models of intuition, based on perception of moods and emotions in addition to the language in a multimodal dialogue system. Mood recognition has mostly been studied in the context of gestures and postures (see, e.g., [57]). Boredom has visual gesture indicators, e.g., looking at the watch, yawning (also cf. “a bit of a yawn”, and putting the hand to the mouth), the posture of a buckled upper part of the body, and slouchy or slow basic motions. However, the mimicry of boredom is much more difficult to describe as, e.g., the mimicry of the basic emotions (figure 3). This also means that automatic facial expression methods (e.g., [58], who use robust facial expression recognition from face video) have difficulties in detecting this even when analysing the temporal behaviour of the facial muscle units. In [59], 3D wireframe face models were used to discriminate happiness from anger and occluded faces. However, more complex expressions as in figure 7 cannot be detected with the required accuracy12 The recognition on a 2D surface is even more difficult. How come humans are able to easily detect eagerness and boredom in watercoloured coal drawings? Especially the correct interpretation of these non-verbal behavioural signals is paramount for an intuitive reaction in multimodal dialogue, e.g., a proposition for personal recreational activities (based on context information and a user profile; see figure 8, left). The following dialogue example (adapted and extended from [46]) exemplifies intuitive reaction behaviour of the host (H) as a result of his intuitive perception of the emotions and moods of the players Mr. Kaiser and Ms. Scherer in the football game (figure 8, right). 1. 2. 3. 4. 5.
12 13 14
H: “Now, pay attention [points to paused video behind them] What will happen next? One—Ballack will score, Two—the goalie makes a save, or Three—Ballack misses the shot?” H: “What do you think, Mr. Kaiser?” K: “I think Ballack’s going to score!”13 H: “Spoken like a true soccer coach.”14 H: “Then let’s take a look and see what happens!” (All turn and watch the screen. Ballack shoots but misses the goal.)
See more facial expressions and emotions, and other non-verbal (extra-linguistic). behavioural signals in [60]. Mr. Kaiser interprets the host’s question as something positive, since he was asked and not Ms. Scherer. The host intuitively perceives the certainty in Mr. Kaiser’s reaction. According to [61] (their related context is tutoring dialogues for students) the host perceives a InonU (incorrect and not uncertain) or CnonU (correct and not uncertain) reaction.
Intuition as Instinctive Dialogue 6. 7. 8. 9. 10. 11.
97
H: “Well, it seems you were really wrong!”15 H: “But, don’t worry Mr. Kaiser, this isn’t over yet. Remember, you still have two more chances to guess correctly!”16 H: “This time, um, I guess the goalie will make a save.”17 H: “Ms. Scherer, do you agree with Mr. Kaiser this time?” (active violation of the one player-host question-answer pattern) S: “Oh, uh, I guess so.”18 H: “Great! Well, then let’s see what happens this time!”19
Fig. 7. (Left) Eager demand for a reaction. (Right) No resources against boredom.
Intuition in dialogue systems also has a strong active component. It should help the system react in an appropriate way, e.g., to avoid distracting the user such that the cognitive load remains low and the user can still focus on the primary task, or, to motivate, convince, or persuade the user to develop an idea further and/or pursue mutual goals. 15 16 17 18
19
The host recognises the InonU case. Mr. Kaiser looks down, is very disappointed (and knows he should not have been that cheeky). Ms. Scherer looks pleased. The host recognises that Mr. Kaiser is disappointed and that Ms. Scherer is mischievous. Host notices that Mr. Kaiser is now more moderate and Ms. Scherer is not paying attention anymore and tries to include her as well. Ms. Scherer realises she has not been paying attention and is embarrassed about having been caught. Immediately, she is also relieved, though, since she realises that the host is not trying to make this evident but bring her focus back to the game. The host knows the focus of both players is back and goes on with the game.
98
D. Sonntag
Fig. 8. (Left) Intuitive soccer dialogue through context information with the help of a mobile dialogue system. (Right) Dialogue with virtual characters in a virtual studio (reprinted with permission [46]). In the mobile situation, intuition is needed to understand the intentions of the user (perception). In the studio, intuition can lead to emotional expression and emphatic group behaviour (generation).
Techniques for information fusion are at the heart of intuitive dialogue design. Humans are capable of correctly fusing different information streams. For example, think of the combination of spoken language and visual scene analysis you have to perform when a passenger explains the route to you while you are driving. People can automatically adapt to an interlocutor’s dominant integration pattern (parallel or sequential). This means, e.g., while explaining the route to you, your friend points to a street and says “This way!”; he can do this in a parallel or sequential output integration pattern and you adapt accordingly. Whereas the on-focus/off-focus detection alone (figure 4, right) cannot really be seen as intuitive, the combination of automatically-triggered speech recognition with deictic reference fusion (“Is this (+ deictic pointing gesture) really the correct way?”) and gesture recognition for dissatisfaction/unhappiness actually can be seen as intuitive. The system can perceive dissatisfaction and suggest a query reformulation step or other query processing adaptions. 5.1
Self-reflective Model
We believe that a self-reflective model is what accounts for intuitive behaviour. Psychological literature provides the necessary concept of such a self-reflective model. Therefore, we will introduce self-reflection by applying the general processing framework of metacognition by [62]. In their theoretical metamemory framework, they base their analysis of metacognition on three principles: 1. The cognitive processes are split into two or more interrelated levels. 2. The meta-level (metacognition) contains a dynamic model of the object-level (cognition). 3. The two dominant relations between the levels are called control and monitoring.
Intuition as Instinctive Dialogue
99
The basic conceptual architecture consists of two levels, the object-level and the meta-level (figure 9). We will use the two-level structure to model a selfreflective model for implementing intuition. (Specialisations to more than two levels have also been developed, see [63]). This architecture extends approaches to multi-strategy dialogue management, see [64] for example. The main point is that we can maintain a model of the dialogue environment on the meta-level which contains the context model, gets access to the multimodal sensory input, and also contains self-reflective information. We called such a model an introspective view, see figure 10. An introspective view emerges from monitoring the object level—the correct interpretation of available sensory inputs and the combination of this information with prior knowledge and experiences. We can use this knowledge to implement a dialogue reaction behaviour (cf. [34], p. 194f) that can be called intuitive since the object level control fulfills intuitive functions, i.e., the initiation, maintenance, or termination of objectlevel cognitive activities; our intuition controls our object-level behaviour by formulating dialogue goals and triggering dialogue actions.
Meta-Level*
2 Monitoring
Information Flow Control
1
Object-Level
Fig. 9. The self-reflective mechanism consists of two structures, (1) the object-level, and (2) the meta-level, whereby 1 → 2 is an asymmetric relation monitoring, and 2 → 1 is an asymmetric relation control. Both form the information flow between the two levels; monitoring informs the meta-level and allows the meta-level to be updated. Depending on the meta-level, the object-level is controlled, i.e., to initiate, maintain, or terminate object-level cognitive activities like information retrieval or other dialogue actions.
Machine Learning Model. What is worth learning in the context of intuition? For example, learning about the adaptation to differences in user integration patterns (i.e., not waiting for multimodal input when the input is unimodal would be beneficial). The understanding of the user’s (temporal) integration patterns of multiple input modalities can be seen as intuitively understanding how to fuse the passive sensory input modes of a user. This can be seen as a
100
D. Sonntag
Fig. 10. Introspective View. Several processing resources (PRs), such as sensories, are connected to a central dialogue system hub. The iHUB interprets PR inputs and outputs on a meta level.
machine learning classification task. The problem identification step (diagnosis) will be supported by data mining models which are learned by mining the process data log files which will have been obtained by running the baseline system. Unsupervised association rule generation can be employed to identify rules of failure and success according to the item sets derived from processing metadata20 . Methodology to Implement Intuition. We will briefly discuss a new methodology of system improvements to ensure future performances for implementing intuition in dialogue systems. The methodology and methods for selfreflection in dialogue systems consist of the following data and model resources (delivering ML input data or ML models): – Cognitive Ground: The theory assumes a baseline functional dialogue reaction and presentation manager. Such a manager has been developed in [34]. Possible system improvements are identified by reaction utility shortcomings that become obvious by running the baseline finite state-based dialogue system in dialogue sessions with real users. According to this empirical dialogue system evaluation, a subset of the theoretical reaction constraints and utilities (cf. dialogue constraints implementing intuition) can be identified by the dialogue system experts and is improved by the supplied ML methods. 20
For more information, see [34], pp. 179-208.
Intuition as Instinctive Dialogue
101
– Information States: Information states will be implemented by a local and a global state. Global states hold information about the current dialogue session and user profiles/models. The local state contains information about the current turn and the changes in the system or user model. – Feature Extraction: In the feature extraction phase, we extract features for the different empirical machine learning models. Algorithm performances on different feature subsets provide a posteriori evidence for the usefulness of individual features. – Associations: Domain knowledge is often declarative, whereas control knowledge is more operational, which means that it can change over time or can only be correctly modelled a posteriori. Associations bridge the gap between cognition towards metacognition and intuitions. The cognitive ground can be used to induce dynamic associations for adaptation purposes. Implemented Dialogue Feedback Example. Intuitive question feedback by using concept icons has been shown in figure 6. This example showed a multimodal speech-based question answering (QA) dialogue on mobile telephone devices [23]. When human users intuitively adapt to their dialogue partners, they try to make the conversation informative and relevant. We can also assume that they avoid saying falsehoods or that they indicate a lack of adequate evidence. In order to learn similar intuitive dialogical interaction capabilities for question answering applications, we used the aforementioned methodology to implement intuition in information-seeking dialogues. In the information and knowledge retrieval context, information sources may change their quality characteristics, e.g., accessibility, response time, and reliability. Therefore, we implemented an introspective view on the processing workflow: machine learning methods update the reasoning process for dialogue decisions in order to allow the dialogue system to provide intuitive dialogue feedback. More precisely, we tried to incorporate intuitive meta dialogue when interpreting the user question and addressing heterogeneous information sources. For example, we ran a baseline system, recorded the current state of the dialogue system, extracted features according to the introspective view/sensory input, and tried to generalise the found associations to knowledge that allows for more intuitive system reactions [34]. The associations we extracted basically revealed which types of questions we are able to answer with the current databases, how long it might take to answer a specific request, and how reliable an answer from an open-domain search engine might be. We could say that the system has an intuition of the probability of success or failure. This intuition (perceived/learned model about the dialogue environment) can be used to provide more intuitive question feedback of the forms: “My intuition says that... I cannot find it in my knowledge base” (stream/answer time prediction); “... I should better search the Internet for a suitable answer” (database prediction); or “... empty results are not expected, but the results won’t be entirely certain,” (answer prediction). A combination of the last two predictive models even allows a dialogue system to intuitively rely
102
D. Sonntag
on specific open domain QA results (cf. the dialogue fragment further down). The Answer type Person is predicted to be highly confident for open-domain QA, so the system provides a short answer to the Person question instead of a list of documents as answer. 1. 2. 3. 4.
U: “Who is the German Chancellor?” S: “Who is the German Chancellor?” (intuitively repeats a difficult question) S: “I will search the Internet for a suitable answer.” S: “Angela Merkel.” (intuitively relies on the first entry in the result set)
Although the exemplified intuitive behaviour is by far less complicated, or intuitive, than the host’s behaviour in the envisioned example of the intuitive dialogue section, we think that the self-reflective model is an important step towards further achievements in the area of implementations of intuition in multimodal dialogue.
6
Conclusion
Intelligent user interfaces have to become more human-centred. This is not only because state-of-the-art HCIs are far beyond human interaction capabilities, but also because the amount of information to be processed is constantly growing. This makes the automatic selection of suitable information more complicated, and personalised and adapted user interfaces more valuable. [65] argue that services (e.g., for information retrieval) should be organised in a service-oriented architecture that enables self-organisation of ambient services in order to support the users’ activities and goals. User interfaces that instinctively take initiative during multimodal dialogue to achieve a common goal, which is negotiated with the user, provide new opportunities and research directions. Intuitive dialogue goes one step further. It does not only take the multimodal sensory input spaces into account to trigger instinctive dialogue reaction rules, but allows for maintaining a self-reflective model. This model can evolve over time with the help of learned action rules. If a model is updated and applied unconsciously, we can speak of modelling and implementing intuition in multimodal dialogue. One of the major questions for further debates would be whether appropriate intelligent behaviour in special situations, such as intuition in natural multimodal dialogue situations, is more rooted in past experience (as argued here in the context of intuition and learning) than logical deduction or other relatives in planning and logical reasoning. Acknowledgements. This research has been supported in part by the THESEUS Programme in the Core Technology Cluster WP4, which is funded by the German Federal Ministry of Economics and Technology (01MQ07016). The responsibility for this publication lies with the author. I would like to thank Colette Weihrauch for the help in the formulation of dialogue constraints and related example dialogues. Thanks to Ingrid Zukerman for useful conversations.
Intuition as Instinctive Dialogue
103
References 1. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn. Prentice-Hall, Englewood Cliffs (2003) 2. Mitchell, T.M.: Machine Learning. McGraw-Hill International Edit, New York (1997) 3. Maybury, M., Stock, O., Wahlster, W.: Intelligent interactive entertainment grand challenges. IEEE Intelligent Systems 21(5), 14–18 (2006) 4. Maybury, M.T.: Planning multimedia explanations using communicative acts, pp. 59–74. American Association for Artificial Intelligence, Menlo Park (1993) 5. Sonntag, D.: Introspection and adaptable model integration for dialogue-based question answering. In: Proceedings of the Twenty-first International Joint Conferences on Artificial Intelligence, IJCAI (2009) 6. Cole, R.A., Mariani, J., Uszkoreit, H., Varile, G., Zaenen, A., Zue, V., Zampolli, A. (eds.): Survey of the State of the Art in Human Language Technology. Cambridge University Press and Giardini, New York (1997) 7. Jelinek, F.: Statistical Methods for Speech Recognition (Language, Speech, and Communication). The MIT Press, Cambridge (January1998) 8. McTear, M.F.: Spoken dialogue technology: Enabling the conversational user interface. ACM Computing Survey 34(1), 90–169 (2002) 9. McTear, M.: Spoken Dialogue Technology. Springer, Berlin (2004) 10. Dybkjaer, L., Minker, W. (eds.): Recent Trends in Discourse and Dialogue. Text, Speech and Language Technology, vol. 39. Springer, Dordrecht (2008) 11. Maybury, M., Wahlster, W. (eds.): Intelligent User Interfaces. Morgan Kaufmann, San Francisco (1998) 12. van Kuppevelt, J., Dybkjaer, L., Bernsen, N.O.: Advances in Natural Multimodal Dialogue Systems (Text, Speech and Language Technology). Springer-Verlag New York, Inc., Secaucus (2007) 13. Woszczyna, M., Aoki-Waibel, N., Buo, F., Coccaro, N., Horiguchi, K., Kemp, T., Lavie, A., McNair, A., Polzin, T., Rogina, I., Rose, C., Schultz, T., Suhm, B., Tomita, M., Waibel, A.: Janus 1993: towards spontaneous speech translation. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 345–348 (1994) 14. Wahlster, W. (ed.): VERBMOBIL: Foundations of Speech-to-Speech Translation. Springer, Heidelberg (2000) 15. Seneff, S., Hurley, E., Lau, R., Pao, C., Schmid, P., Zue, V.: Galaxy-II: A reference architecture for conversational system development. In: Proceedings of ICSLP 1998, vol. 3, pp. 931–934 (1998) 16. Walker, M.A., Passonneau, R.J., Boland, J.E.: Quantitative and Qualitative Evaluation of the Darpa Communicator Spoken Dialogue Systems. In: Meeting of the Association for Computational Linguistics, pp. 515–522 (2001) 17. Walker, M.A., Rudnicky, A., Prasad, R., Aberdeen, J., Bratt, E.O., Garofolo, J., Hastie, H., Le, A., Pellom, B., Potamianos, A., Passonneau, R., Roukos, S.S.G., Seneff, S., Stallard, D.: Darpa communicator: Cross-system results for the 2001 evaluation. In: Proceedings of ICSLP, pp. 269–272 (2002) 18. Wahlster, W. (ed.): SmartKom: Foundations of Multimodal Dialogue Systems. Springer, Berlin (2006) 19. Wahlster, W.: SmartWeb: Mobile Applications of the Semantic Web. In: Dadam, P., Reichert, M. (eds.) GI Jahrestagung 2004, pp. 26–27. Springer, Heidelberg (2004)
104
D. Sonntag
20. Reithinger, N., Bergweiler, S., Engel, R., Herzog, G., Pfleger, N., Romanelli, M., Sonntag, D.: A Look Under the Hood—Design and Development of the First SmartWeb System Demonstrator. In: Proceedings of the 7th International Conference on Multimodal Interfaces (ICMI), Trento, Italy (2005) 21. Sonntag, D., Engel, R., Herzog, G., Pfalzgraf, A., Pfleger, N., Romanelli, M., Reithinger, N.: [66] 272–295 22. Aaron, A., Chen, S., Cohen, P., Dharanipragada, S., Eide, E., Franz, M., Leroux, J.M., Luo, X., Maison, B., Mangu, L., Mathes, T., Novak, M., Olsen, P., Picheny, M., Printz, H., Ramabhadran, B., Sakrajda, A., Saon, G., Tydlitat, B., Visweswariah, K., Yuk, D.: Speech recognition for DARPA Communicator. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 489–492 (2001) 23. Sonntag, D., Reithinger, N.: SmartWeb Handheld Interaction: General Interactions and Result Display for User-System Multimodal Dialogue. In: Smartweb technical document, DFKI, Saarbruecken, Germany, vol. 5 (2007) 24. Reithinger, N., Alexandersson, J., Becker, T., Blocher, A., Engel, R., L¨ ockelt, M., M¨ uller, J., Pfleger, N., Poller, P., Streit, M., Tschernomas, V.: SmartKom: Adaptive and Flexible Multimodal Access to Multiple Applications. In: Proceedings of the 5th Int. Conf. on Multimodal Interfaces, pp. 101–108. ACM Press, Vancouver (2003) 25. Horvitz, E.: Uncertainty, action, and interaction: In pursuit of mixed-initiative computing. IEEE Intelligent Systems 14, 17–20 (1999) 26. van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979) 27. Manning, C.D., Raghavan, P., Sch¨ utze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008) 28. Ingwersen, P.: Information Retrieval Interaction. Taylor Graham, London (1992) 29. Cai, Y.: [66] 17–46 30. Singh, S., Litman, D., Kearns, M., Walker, M.A.: Optimizing Dialogue Management with Reinforcement Learning: Experiments with the NJFun System. Journal of Artificial Intelligence Research (JAIR) 16, 105–133 (2002) 31. Jameson, A.: Adaptive interfaces and agents. In: The human-computer interaction handbook: fundamentals, evolving technologies and emerging applications, pp. 305–330. Lawrence Erlbaum Associates, Inc., Mahwah (2003) 32. Paek, T., Chickering, D.: The markov assumption in spoken dialogue management. In: Proceedings of the 6th SigDial Workshop on Discourse and Dialogue, Lisbon, Portugal (2005) 33. Oviatt, S., Swindells, C., Arthur, A.: Implicit user-adaptive system engagement in speech and pen interfaces. In: CHI 2008: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, pp. 969–978. ACM, New York (2008) 34. Sonntag, D.: Ontologies and Adaptivity in Dialogue for Question Answering. AKA Press and IOS Press (January 2010) 35. Strachan, L., Anderson, J., Evans, M.: Pragmatic user modelling in a commercial software system. In: Proceedings of the 6th International Conference on User Modeling, pp. 189–200. Springer, Heidelberg (1997) 36. Forbes-Riley, K., Litman, D.: A user modeling-based performance analysis of a wizarded uncertainty-adaptive dialogue system corpus. In: Proceedings of Interspeech (2009) 37. Zukerman, I., Litman, D.J.: Natural language processing and user modeling: Synergies and limitations. User Model. User-Adapt. Interact. 11(1-2), 129–158 (2001)
Intuition as Instinctive Dialogue
105
38. Kaizer, S., Bunt, H.: Multidimensional dialogue management. In: Proceedings of the 7th SigDial Workshop on Discourse and Dialogue, Sydney, Australia (July 2006) 39. Brown, P., Levinson, S.C.: Politeness: Some Universals in Language Usage (Studies in Interactional Sociolinguistics). Cambridge University Press, Cambridge (1987) 40. Gruber, T.R.: Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In: Guarino, N., Poli, R. (eds.) Formal Ontology in Conceptual Analysis and Knowledge Representation. Kluwer Academic Publishers, The Netherlands (1993) 41. Oviatt, S., MacEachern, M., Levow, G.A.: Predicting hyperarticulate speech during human-computer error resolution. Speech Commun. 24(2), 87–110 (1998) 42. Paek, T., Horvitz, E., Ringger, E.: Continuous Listening for Unconstrained Spoken Dialog. In: Proceedings of the 6th International Conference on Spoken Language Processing, ICSLP 2000 (2000) 43. Gebhard, P.: Emotionalisierung interaktiver Virtueller Charaktere - Ein mehrschichtiges Computermodell zur Erzeugung und Simulation von Gefuehlen in Echtzeit. PhD thesis, Saarland University (2007) 44. Ruckmick, C.A.: The Psychology of Feeling and Emotion. McGraw-Hill, New York (1936) 45. Morris, W.N.: Mood: The Frame of Mind. Springer, New York (1989) 46. Reithinger, N., Gebhard, P., L¨ ockelt, M., Ndiaye, A., Pfleger, N., Klesen, M.: Virtualhuman: dialogic and affective interaction with virtual characters. In: ICMI 2006: Proceedings of the 8th international conference on Multimodal interfaces, pp. 51–58. ACM, New York (2006) 47. Andr´e, E., Dybkjær, L., Minker, W., Heisterkamp, P. (eds.): ADS 2004. LNCS (LNAI), vol. 3068. Springer, Heidelberg (2004) 48. Mehrabian, A.: Outline of a general emotion-based theory of temperament. In: Explorations in temperament: International Perspectives on Theory and Measurement, pp. 75–86. Plenum, New York (1991) 49. Ortony, A., Clore, G.L., Collins, A.: The Cognitive Structure of Emotions. Cambridge University Press, Cambridge (1988) 50. McCrae, R., John, O.: An introduction to the five-factor model and its applications 60, 175–215 (1992) 51. Minsky, M.: The Emotion Machine: Commonsense Thinking, Artificial Intelligence, and the Future of the Human Mind. Simon & Schuster, New York (November 2006) 52. Tosa, N., Nakatsu, R.: Life-like communication agent -emotion sensing character ”mic” & feeling session character ”muse”-. In: ICMCS 1996: Proceedings of the 1996 International Conference on Multimedia Computing and Systems (ICMCS 1996). IEEE Computer Society, Washington, DC (1996) 53. Batliner, A.: Whence and whither: The automatic recognition of emotions in speech (invited keynote). In: Andr´e, E., Dybkjær, L., Minker, W., Neumann, H., Pieraccini, R., Weber, M. (eds.) PIT 2008. LNCS (LNAI), vol. 5078, pp. 1–1. Springer, Heidelberg (2008) 54. Rotaru, M., Litman, D.J., Forbes-Riley, K.: Interactions between speech recognition problems and user emotions. In: Proceedings of Interspeech 2005 (2005) 55. Litman, D.J., Moore, J.D., Dzikovska, M., Farrow, E.: Using natural language processing to analyze tutorial dialogue corpora across domains modalities. In: [67] 149–156
106
D. Sonntag
56. van Turnhout, K., Terken, J., Bakx, I., Eggen, B.: Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features. In: ICMI 2005: Proceedings of the 7th International Conference on Multimodal Interfaces, pp. 175–182. ACM, New York (2005) 57. Mehrabian, A.: Nonverbal Communication. Aldine-Atherton, Chicago, Illinois (1972) 58. Valstar, M., Pantic, M.: Fully automatic facial action unit detection and temporal analysis. In: CVPRW 2006: Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop, vol. 149. IEEE Computer Society, Washington, DC (2006) 59. Tao, H., Huang, T.S.: Connected vibrations: A modal analysis approach for nonrigid motion tracking. In: CVPR 1998: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 735. IEEE Computer Society, Washington, DC (1998) 60. Keltner, D., Ekman, P.: Facial Expression of Emotion. In: Handbook of Emotions, pp. 236–249. The Guilford Press, New York (2000) 61. Forbes-Riley, K., Litman, D.J.: Adapting to student uncertainty improves tutoring dialogues. In: [67], pp. 33–40 62. Nelson, T.O., Narens, L.: Metamemory: A theoretical framework and new findings. In: Bower, G.H. (ed.) The Psychology of Learning and Motivation: Advances in Research and Theory, vol. 26, pp. 125–169. Academic Press, London (1990) 63. Sonntag, D.: On introspection, metacognitive control and augmented data mining live cycles. CoRR abs/0807.4417 (2008) 64. Chu, S.W., ONeill, I., Hanna, P., McTear, M.: An approach to multi-strategy dialogue management. In: Proceedings of INTERSPEECH, Lisbon, Portugal, pp. 865–868 (2005) 65. Studer, R., Ankolekar, A., Hitzler, P., Sure, Y.: A Semantic Future for AI. IEEE Intelligent Systems 21(4), 8–9 (2006) 66. Huang, T.S., Nijholt, A., Pantic, M., Pentland, A. (eds.): ICMI/IJCAI Workshops 2007. LNCS (LNAI), vol. 4451. Springer, Heidelberg (2007) 67. Dimitrova, V., Mizoguchi, R., du Boulay, B., Graesser, A.C. (eds.): Artificial Intelligence in Education: Building Learning Systems that Care: From Knowledge Representation to Affective Modelling, Proceedings of the 14th International Conference on Artificial Intelligence in Education, AIED 2009 Frontiers in Artificial Intelligence and Applications, Brighton, UK, July 6-10, vol. 200. IOS Press, Amsterdam (2009)
Human Performance in Virtual Environments Yvonne R. Masakowski and Steven K. Aguiar Naval Undersea Warfare Center, Newport, RI
Abstract. There are significant advances in virtual world technologies that permit collaboration in a distributed, virtual environment. In the real world environment, distributed teams collaborate via face-to-face communication, using social interactions, such as eye contact and gestures, which provide critical feedback to the human decision maker. The virtual world affords the system designer the ability to evaluate an operator’s ability to respond to information (e.g., events, goals, objectives, etc.) in a complex, distributed team environment. The question is, how do we evaluate human performance and cognitive processes of decision makers within the virtual environment? We expect that virtual environments should facilitate sharing ideas, information and strategies among team members to achieve situation awareness and effective decisionmaking. This paper will discuss ways to evaluate performance in virtual environments and the critical role that immersive 3D environments will play in future ship designs.
1 Introduction Virtual world technologies have begun to present opportunities for the development of collaborative design environments and to facilitate the user’s ability to explore numerous applications, such as training and system design. Virtual immersive environments also provide a unique means of developing rapid prototypes to evaluate system and operator performance during the early stages of design. For the US Navy, virtual technologies provide a tool for exploring new system designs that will reduce total ownership costs during the platform design process and facilitate a means of examining the impact of reducing manning on the platform. Reduction in total ownership costs for future platform designs is a critical driver but should not be exercised at the cost of system performance and/or human safety. Total system performance mandates that system and human performance are intricately linked and must be evaluated throughout the design process, especially at the early stages of design. The overriding costs related to re-engineering inadequate system designs already integrated in the platform design are far too costly. Therefore, virtual world technologies provide a unique means of assessing metrics of performance, which facilitates modifications and reconfiguration throughout the design process. One of the principal advantages of virtual world systems is that virtual technologies connect and enable distributed and disparate networks, as well as enable teams of individuals to communicate and work together within a virtual workspace. Virtual technology affords the system designer unique advantages in terms of design and Y. Cai (Ed.): Computing with Instinct, LNAI 5897, pp. 107–118, 2011. © Springer-Verlag Berlin Heidelberg 2011
108
Y.R. Masakowski and S.K. Aguiar
team productivity. However, there are costs and trade-offs related to human performance that must be considered.
2 Trade-offs in Performance: Challenges and Constraints The flexibility of the virtual environment affords the user the opportunity to explore their environment in a unique manner. However, virtual exploration differs greatly from human exploration in the real world. From a biological perspective, humans are hard-wired for processing sensory input. Multisensory experience affords humans an opportunity to evaluate their environment and filter vast amounts of information as they move through it. Humans adapt to changes in luminance, shifts in auditory thresholds, tactile and kinesthetic cues. Each of these sensory inputs affords proprioceptive feedback, which is essential for human performance, safety and survival. Humans are adaptive to changes in the environment such that use their perceptual and cognitive capacities. For example, the human visual system adapts to threshold changes in luminance and ensures our ability to detect signals above a threshold (e.g., speaking and whispering) and to move toward or away from a specific sound, depending upon its classification. Thus, the real world environment is replete with information processed by humans’ multisensory processing system. Our ability to move about in the real world environment, albeit complex, is supported by our nervous system’s ability to process multiple channels of information from various sensory modalities. In contrast, the virtual world lacks such proprioceptive feedback, which occurs during the normal course of human interactions with the environment. Humans
Fig. 1. Submarine Control Center in a Virtual World
Human Performance in Virtual Environments
109
navigate in the real world and gain feedback via their neurological system in synchrony with their movements and behaviors. For example, when riding a bicycle, one adjusts posture to accommodate and maintain balance; or, when moving from bright lights to a dimly lit room, the individual’s visual system accommodates and adjusts to changes in luminance thresholds [28]. Proprioceptive feedback is critical to our understanding of our place in the environment and how we move about in the real world. In contrast, the virtual world is one in which humans manipulate Avatars as a means of moving about in the synthetic, immersive environment. Moving or flying an Avatar is a cognitive and motoric learning experience vs a biologically sensory experience. The lack of proprioceptive feedback is a gap in providing input to our multisensory pathways. The virtual environment is comprised of simulated displays, Avatars and systems, while the real operator remains unseen. Therefore, the operator must learn to navigate and interact with information and Avatars in the simulated environment. This environment presents a challenge both perceptually and cognitively. Cognitive theories posit the utilization, storage and retrieval of information [29]. Workload theories suggest that there are several different resource pools, each limited in nature and responsible for a different kind of processing. Wickens’ multipleresource theory is among the most widely accepted of these. Within this framework, differences are defined by how information is presented and how it is perceived by the individual. This level breaks down into two dimensions: modality and processing code. Modality refers to the physical method of perception, either visual or auditory. In addition, information can be coded either spatially or verbally. This creates a situation in which there are four types of perception/encoding: • • • •
Visual Modality/Spatial Code (e.g. graph) Visual Modality/Verbal Code (e.g. text). Auditory Modality/Spatial Code (e.g. sound localization). Auditory Modality/Verbal Code (e.g. speech).
In order to evaluate work overload, one must evaluate the subject's ability to meet task performance and time constraints. Task performance is impacted by the level of workload and one’s ability to manage information within constraints (e.g. time constraints, resource constraints, etc.). Thus, work overload may negatively impact one’s comprehension and overall performance. An example of the impact of constraints may be seen by the way in which one learns to manipulate movements of the Avatar in the virtual world. One struggles to achieve smooth movements that will facilitate task completion. The virtual world operator must remain cognizant of the skills required to move the Avatar and perform the tasks required to achieve and maintain situational awareness. The future battle space environment is both complex and distributed; thus, there is a need to employ collaborative technologies, such as virtual environments, to evaluate the most effective means of team performance. Team communication impacts workload, situational awareness and performance [30]. This study highlighted the critical role of communication among team members in a simulated C2 environment. Therefore, virtual environments provide a means for evaluating various configurations, tasks and workloads as they are distributed to each team member.
110
Y.R. Masakowski and S.K. Aguiar
Indeed, proponents of net-centric warfare [41] contend that the integration of distributed teams is a critical component of command and control (C2) environments and is essential for achieving situational awareness. Performance in the virtual environment challenges both the individual and the team’s approach to achieve a level of situational awareness and accomplish tasks. Performance assessment in the virtual environment also raises questions regarding the validity of the measures garnered during operator task performance. One approach to addressing these differences is to evaluate human performance in a comparative study which would enable the investigation of human performance in both real and synthetic environments. The study would thereby facilitate an understanding of the relationship(s) of humans performing tasks with the virtual world environment. In contrast to traditional system design, the virtual world environment provides a unique means of re-configuring workspaces, rapid prototyping and examining operator performance. Figure 1 illustrates a prototype of a submarine control center in a virtual world environment. This virtual representation provides designers a means of evaluating various configurations for the combat command and control room. We recently ran a pilot study in which we evaluated human performance in both real and virtual world environments. We utilized a standard Navy scenario for operators to evaluate their ability to reconfigure the interface to meet their task requirements and we conducted a brief torpedo firing exercise. Although this was only a pilot study, preliminary results highlight the ease with which experience in the virtual world environment facilitated the operators’ performance in the real world environment. Specifically, operators performed equally well in both environments and reported that experience in the virtual environment primed their ability to focus on the task at hand more effectively than otherwise. More interesting was the fact that there were critical differences between novice and expert operators’ performances. Novices gained more insight by using the virtual environment first, to guide their task performance during the real world experiment. Designers must be aware of the challenges presented by the virtual environment in terms of perceptual expectations of the user. There are critical differences among users and their level of knowledge, expertise and experience. For example, the novice and expert end user have a different set of expectations that must be taken into consideration during the early stages of design. Thus, the system designer must consider and accommodate system designs, layouts and display configurability with the end user in mind. The ability of the user to shape their information display according to their specific role and requirements is one of the critical components of display designs. Re-configurability affords the end user a means of managing and manipulating information as they require. However, most often, designers fail to address this issue and lack a coherent approach to interface design, which is essential to supporting the end user. Several studies have shown that virtual world environments present challenges with regard to usability [31-32]. Users must learn ways to compensate for the lack of natural interaction with the information in the environment and to manipulate information embedded in display systems. Attention management is often impeded by the operator’s need to move the Avatar to the information as opposed to merely processing information presented in front of them. Thus, research has demonstrated the
Human Performance in Virtual Environments
111
importance of focusing on the usability of the interface and its dimensions during the early stages of product design. In addition, it is critical to evaluate not only the individual’s approach to interacting with the displays in the environment but also the ways in which teams collaborate and share information within this setting. The ergonomics of virtual collaboration highlight the need to attend to processes for sharing information as well as validating and developing trust to achieve successful task completion [33]. Collaborative virtual environments provide a means of evaluating the capabilities and affordances provided by this technology among distributed teams. Specifically, virtual environments are a powerful collaborative tool which facilitates a team’s ability to successfully design systems in a distributed manner [14]. Reconfigurations of system design layouts may be done in real time among team members. Operators can interact with each other and take control of real tactical functions provided elsewhere in the network. The virtual environment supports the development of distributed system design teams who can share their network of tools, designs and analyses among all members. However, one of the principal challenges remaining in the virtual environment is that of communication within the environment. Although the virtual environment affords social interaction using chat, visual interaction and gestures, the quality of interaction is impeded by the lack of face-to-face contact. In the real world, individuals share information within and among team members verbally and nonverbally. Virtual network environments are highly decentralized with distributed personnel in the network. Whereas gestural and voice data may be captured, subtle forms of communication such as eye gaze movement during communication are unavailable in the virtual environment. Eye-to-eye contact facilitates the development of team interactions, trust and coordinated activities, and support the development of working as a team/unit. A cohesive team is essential to achieve mission success. Team members must achieve shared understanding and awareness garnered from each other, the data, the operational context and the courses of action available within the conditions presented to them. In real time, team members communicate and validate by sharing information and seeking support via eye-to-eye contact, which serves as a means of attaining trust and validation within groups. This human nonverbal contact is an essential component for building one’s mental schema and situational awareness. It has been well documented that eye tracking provides an important means of evaluating human social interaction. Humans can infer intent, interest, fear and/or unease from an individual’s gaze.
3 Cognitive Research and Measures: Attention Management One of the key questions is how do we evaluate trust and the accuracy of information/knowledge among virtual team members? In a real world environment, distributed teams collaborate using a variety of tools to facilitate the development of trust among participants by providing context for each participant’s contribution and its impact on decision making. Information may be dynamically presented during each
112
Y.R. Masakowski and S.K. Aguiar
interaction, which provides a means of evaluating the validity of each person’s contribution toward achieving a common understanding of the situation. However, there is a risk associated with presenting too much information and presenting it in a random, chaotic manner, which would, at minimum, confuse and obscure valid information. Thus, a system which provides time-critical information among decision makers in a clear and concise manner can facilitate collaboration among partners, while reducing overall workload among team members and enhancing situational awareness [34]. Endsley [34] described Situational Awareness (SA) in terms of three levels: perception, comprehension, and projection. Specifically, situational awareness is the result of achieving a comprehensive understanding of the battle space within an operational context, which enables us to make effective and accurate decisions. In the military domain, SA is further forged by awareness and understanding of an adversary’s knowledge and capabilities. One of the key components in the acquisition of SA and its implementation is the cognitive process used to evaluate information and make a decision within a dynamic environment.
Fig. 2. Components of Human Performance in Virtual Environments [38]
It is essential to ensure that virtual teams have the opportunity to evaluate information for its accuracy and reliability. Virtual teams can be somewhat defended from the chaos of the environment given their physical distances from each other and the scene itself. Rather, they form their judgments in the absence of physical and contextual cues. As a result, collaborators in the virtual team setting must generate context via information presented to them in the virtual world and augment their SA, as they lack the normal, physical boundaries of the real world environment [35]. It is therefore important for virtual team members to establish a common, shared SA via graduated levels of information access, which will enable virtual teams to collaborate effectively [36-37]. The virtual network facilitates collaboration and information
Human Performance in Virtual Environments
113
exchange within the synthetic environment. However, it also raises questions regarding how one might evaluate the effectiveness of the information and its contribution to effective decision-making among team members. A critical first step toward evaluating human performance in a virtual environment is to establish a valid benchmark of performance in the real world. To that end, we recently conducted an experiment wherein we compared human performance in the real environment with that in the virtual world. Due to the complexity of the virtual environment, we developed a set of effective measures to evaluate human performance in the virtual environment. Among these, we evaluated the operator’s ability to interact with the displays and navigate within the virtual environment. We examined the role that communication plays in the validation of information and decision making. We examined team member interactions and processes related to the validation of data/information. We compared operator performance in the real world environment with that of the operator’s performance in the virtual world environment. We evaluated trust and validation metrics as part of the study in our comparison of real world versus virtual world environments and the means used to attain that trust in each setting. We anticipate that the results of this study would afford us with performance metrics which could be used to develop a virtual toolkit to evaluate complex system designs in the future. Virtual technology afforded operators the means to reconfigure the information on their displays. This study focused on a contact management task with both novice and expert operators. We used the Common Observation Recording tool (CORT) to perform an objective task analysis. CORT is a Java-based tool for observational data collection and analysis. Its simple design creates an intuitive user interface for tap screen recording which enables observers to record observations and timestamps. This can be synchronized with other data collection tools. Metrics include assessments of Situational Awareness [34], decision accuracy and confidence levels. This tool has proven to be very effective for capturing meaningful performance metrics while capturing system-based data, such as timing of events and accuracy of solutions. The observation provides a means of capturing an operator’s patterns of behavior as well as team interactions. Self report data provides further information on cognitive factors, such as workload and interface usability. Thus, multiple methods of data collection are essential for achieving an understanding of the user’s interaction with the environment. Cognitive task analyses were conducted and direct observational data using the Common Observation Recording Tool (CORT) was used to record operator interactions with the data. Confidence levels were collected periodically during the study as well. The results of our study indicated that operators performed equally well in both environments. Of interest is the fact that novice operators performed as well as experts when the virtual world study was conducted prior to the real time study and that confidence levels were equivalent for both groups during this session. We found that communication among team members played a critical role in optimizing their performance in the virtual world study. The results of this study provide evidence that virtual world technologies can provide critical experience in learning new processes and provide opportunities for building trust via communication in the absence of a physical partner [15]. We are continuing our exploration of the potential for training and system design in collaborative virtual environments.
114
Y.R. Masakowski and S.K. Aguiar
We know that dynamic and unexpected events are defining characteristics of numerous military scenarios and operations. The virtual environment provides a unique approach to evaluate operator performance and decision-making under conditions of uncertainty [39]. The operational level of the decision maker is critical to their ability to respond to dynamic situational context information (e.g., events, goals, objectives, etc.) in a complex distributed team environment. Furthermore, the virtual environment also provides a means for designers to evaluate the impact of synthetic environments on human performance by comparing different approaches to training, i.e. traditional vs. virtual environment activity. There is a need for individuals to deconflict information and provide shared situational awareness among virtual team members. To effectively address these challenges, decision-makers require tools to achieve situational awareness and decision support tools to facilitate making rapid, robust decisions. Individuals and virtual teams that operate within a netted environment represent an innovative approach to information management and technical operations as a whole. Research is being conducted at NUWC [18] to investigate and compare technical results delivered using virtual, immersive worlds vs. classical environments. Training and education are obvious applications of this technology. Indeed, numerous universities have already engaged in this technology for distributed classrooms around the globe (e.g. University of Florida, Digital Worlds Environment). Remote classrooms enable students and instructors to meet in virtual environments and share experience, knowledge and expertise. Training has taken on a new level by enabling participation of individuals in scenario simulation based training that builds their skill levels in a controlled setting. You can also develop immersive learning curricula which allow students to immerse themselves in an artificial information space design and turn an otherwise passive learning session into an interactive experience. For example, NUWC is developing an immersive learning tool to teach the fundamentals of target motion analysis, from sensor detection to situational awareness. Students can walk into a 3D parameter evaluation plot and understand that the data represents the level of uncertainty [40]. Masakowski et al. will continue studies to evaluate designer performance using knowledge audits, interviews and cognitive task analyses (CTA) [35] as a means of establishing how individuals perform in each of the different environments, i.e. traditional and the immersive, virtual environment. The research team also uses comparative cognitive task analysis (C2TA) to address differences in each of the training methodologies. Comparative studies are well documented and provide a means of employing traditional experimental design to investigate which method or approach is more efficient and/or effective [12]. The ability to monitor and measure human performance and interactions with systems in the virtual environment and with multiple sources of data is the real advantage of the virtual world environment. We anticipate that the results of this study will probe cognitive processes of decision makers and provide insights with regards to their strategies. Additionally, we anticipate insights regarding the impact of their level of knowledge and expertise working in both the real world and virtual world environment. Further, we expect that the virtual environment should provide a rich forum for sharing information and strategies as well as forge an integrated, collaborative team.
Human Performance in Virtual Environments
115
Second Life ™ [42] technologies afford the decision maker the opportunity to synthesize, integrate and modify the presentation of information and data in such a way that will simplify information presentation. The most current system provides a means of interacting with data in a PC–based interface environment where operators gather and display information. Today’s Second Life systems provide an adaptive means of presenting information to the decision maker in a manner that affords them the ability to manipulate data to quickly correlate and display information to all concerned. This ability to interact with data facilitates situational awareness and understanding of the environment and accelerates the decision cycle. The decision-maker of the 21st century needs to acquire an intuitive understanding of his or her environment. Second Life systems will enhance an operator’s ability to achieve SA by integrating information from multi-modalities and construct a mental model of the components based on whatever data is available. In this manner, SA is realized as a dynamic system that supports an individual’s rapidly changing goals and decision-making requirements within the operational environment. However, given the complexity of the environment, the human is taxed to meet the attention requirements of their decision-making space. Therefore, it is essential to evaluate the impact of Second Life systems on an individual’s cognitive processes as a means of evaluating the effectiveness of this technological approach. There is also a need to conduct comparative cognitive analyses to evaluate the impact of this mode of information presentation on teams of individuals who will share the information and database to formulate their decisions. Although the human may be challenged by the complexities in an operational environment, the human brain has an untapped capacity that can be extended. To this end, tools and technologies that can augment our cognitive capacity enable us to achieve a higher level of understanding that extends beyond the primed pattern recognition model by integrating the expertise of the decision-maker with their mental model of the world. The integration and partnering of the human with technologies such as intelligent agent architectures and robotics, as well as future direct brain interfaces, will elevate the decision-maker to the level of “Cognitive Command”. Second Life Systems will provide a unique synthetic environment for the operator to garner information in an intuitive manner. Cognitive processing in the synthetic environment will require integrating information, knowledge, experience and the expertise of the individual decision-maker to facilitate the most effective and accurate decisions. Given the emergence of synthetic environments, it is timely to evaluate the effectiveness of these systems in training and performance of individuals who will be working within highly complex systems and environments in the future.
4 Conclusions In conclusion, we contend that it is critical to use virtual world technologies as a means of exploring various configurations and system designs and for assessing human performance. Further, there is a need to identify the optimal system design that would afford enhanced operator performance. Virtual world technologies are ideal for evaluating collaborative team performance in a distributed environment. Performance differences may be minimized using various configurations in the environment
116
Y.R. Masakowski and S.K. Aguiar
layout and design, and they may be developed to optimize shared task performance. Advances in virtual world technologies will also support future military operations within the distributed environment as we seek to accelerate the decision cycle. Future research is warranted in this area, as there is an opportunity to shape system designs and support team performance in virtual environments.
References 1. Anderson, J.R.: Personal communication (2008) 2. Anderson, J.R., Bothell, D.: An Integrated Theory of the Mind. Psychological Review 111(4), 1036–1060 (2004) 3. Anderson, J.R., Schooler, L.J.: Reflections of the environment in memory. Psychological Science 2, 396–408 (1991) 4. Anderson, J.R.: The Architecture of Cognition. Harvard University Press, Cambridge (1983) 5. Baumann, M.R., Sniezek, J.A., Buerkle, C.A., Salas, E., Klein, G.: Self-evaluation, stress, and performance: A model of decision making under acute stress, Linking expertise and naturalistic decision making. Lawrence Erlbaum Associates Publishers, Mahwah (2001) 6. Bellenkes, A.H., Wickens, C.D., Kramer, A.F.: Visual scanning and pilot expertise: The role of attention flexibility and mental model development. Aviation, Space, and Environmental Medicine 68, 569–579 (1997) 7. Cai, Y.: How Many Pixels Dow We Need to See Things? In: Proceedings of International Conference of Computational Science, Australia, pp. 1064–1073 (2003) 8. Cai, Y.: Minimalism Contex-Aware Displays. Journal of CyberPsychology and Behavior 7(6), 635–644 (2004) 9. Cook, M., Noyes, J., Masakowski, Y.R.: Decision Making in Complex Environments. Ashgate Publishing, UK (2007) 10. Hanisch, K.A., Kramer, A.F., Hulin, C.L.: Cognitive representations, control, and understanding of complex systems: A field study focusing on components of users’ mental models and expert/novice differences. Ergonomics 34, 1129–1148 (1991) 11. Kiyokawa, K., Takemura, H., Yokoya, N.: SeamlessDesign for 3D Object Creation. IEEE MultiMedia 7(1), 22–33 (2000) 12. Kirschenbaum, S., Trafton, J., Pratt, E.: Comparative Cognitive Task Analysis. In: Proc. Human Factors and Ergonomics Society Annual Meeting: Cognitive Engineering and Decision Making, vol. 5, pp. 473–477 (2003) 13. Klein, G., Militello, L.: The Knowledge Audit as a Method for Cognitive Task Analysis. In: Frieman, H. (ed.) Proc. 5th Conference on Naturalistic Decision Making: How Professionals Make Decisions, Stockholm, Sweden (2000) 14. Kolarevic, B., Schmitt, G., Hirschberg, U., Kurmann, D.: An Experiment in Design Collaboration. Automation in Construction 9, 73–81 (2000) 15. Masakowski, Y.R., Maxwell, D., Aguiar, S.: The Impact of Synthetic Virtual Environments on Combat System Design and Operator Performance. In: Proceedings of the 2009 Joint Undersea Warfare Technology Fall Conference. Undersea Warfare: Full Spectrum Capabilities to Preserve Freedom of the Seas (2009) 16. Masakowski, Y.R.: Cognition-Centric Systems Design: A Paradigm Shift in System Design. In: Proceedings of the COMPIT 2008 Conference, Liege, Belgium, pp. 603–607 (2008)
Human Performance in Virtual Environments
117
17. Masakowski, Y.R., Aguiar, S.: The Impact of Synthetic Virtual Environments on Combat System Design and Operator Performance. In: NDIA Joint Undersea Warfare Technology Fall Conference, “Undersea Warfare: Full Spectrum Capabilities to Preserve Freedom of the Seas” Groton, CT (2009) 18. Masakowski, Y.R.: Human Factors for Decision Making and Battlespace Superiority. In: NATO Undersea Research Center (NURC) Workshop on Data Fusion and Anomaly Detection for Maritime Situational Awareness, La Spezia, Italy (2009) 19. Masakowski, Y.R.: The Evaluation of Human Performance in Virtual Environments. In: International Test & Evaluation Association: Advance Technology Conference, Palo Alto, CA (2009) 20. Masakowski, Y.R.: Trade-offs in Performance: Autonomous Unmanned Systems and Their Impact on Human Performance. PROVEN HSI Journal 2(3) (July 2009) 21. Masakowski, Y.R.: The Challenges of Virtual World Technologies and Their Impact on Human Performance. In: Instinctive Computing Workshop, Carnegie Mellon (2009) 22. Pavel, M., Wang, G., Kehai, L.: Augmented Cognition: Allocation of Attention. In: Proceedings of the 36th Hawaii International Conference on System Sciences. IEEE Computer Society, Los Alamitos (2002) 23. Rentsch, J.R., McNeese, M.D., Perusich, K.: Modeling, Measuring, and Mediating Teamwork: The Use of Fuzzy Cognitive Maps and Team Member Schema Similarity to Enhance BMC3 I Decision Making. In: IEEE, pp. 1081–1086 (2000) 24. Shaw, M.L.: A capacity allocation model for reaction time. Journal of Experimental Psychology: Human Perception & Performance 4, 586–598 (1978) 25. Shaw, J.L., Shaw, P.: Optimal allocation of cognitive resources to spatial locations. Journal of Experimental Psychology: Human Perception and Performance 3, 201–211 (1977) 26. Thomas, J.J., Cook, K.A. (eds.): Illuminating the Path: The Research and Development Agenda for Visual Analytics. IEEE Computer Society Press, Los Alamitos (2005) 27. Trika, S., Bannerjee, P., Kashyap, R.L.: Virtual Reality Interfaces for Feature Based Computer-Aided Design Systems. Computer Aided Design 29(8) (1997) 28. Chisholm, H. (ed.): Weber’s Law, 11th edn. Encyclopædia Britannica. Cambridge University Press, Cambridge (1911) 29. Wickens, C.: Multiple Resources and Mental Workload, http://www.ise.ncsu.edu/nsf_itr/794B/papers/ Wickens_2008_HF_MRT.pdf 30. Funke, G., Galster, S.: The effects of spatial processing load and collaboration technology on team performance in a simulated C2 environment. In: Proceedings of ECCE 2007 (2007) 31. Chu, K.: Genetic Space. A.D: Architects in Cyberspace II. 68(11-12) (1998) 32. Chu, C.C., Gadh, R.: A Quantitive Analysis on Virtual Reality-Based Computer Aided Design System Interfaces. In: Proceedings of ASME 2002 International Mechanical Engineering Congress and Exposition (IMECE 2002), New Orleans, USA, November 17-22 (2002) 33. Gill, S.A., Ruddle, R.A.: Using virtual humans to solve real ergonomic design problems. In: Proceedings of the IEE International Conference on Innovation Through Simulation (SIMULATION 1998), pp. 223–229 (1998) 34. Endsley, M.R., Bolte, B., Jones, D.G.: Designing for situation awareness: An approach to human-centered design. Taylor & Francis, London (2003) 35. Lipnack, J., Stamps, J.: Virtual Teams People Working Across Boundaries with Technology, 2nd edn. John Wiley & Sons, New York (2000)
118
Y.R. Masakowski and S.K. Aguiar
36. Rickel, J., Johnson, W.L.: Task-Oriented Collaboration with Embodied Agents in Virtual Worlds. In: Cassell, J., Sullivan, J., Prevost, S. (eds.) Embodied Conversational Agents, MIT Press, Cambridge (2000) 37. Sonnenwald, D.H., Pierce, L.G.: Information behavior in dynamic group work contexts: interwoven situational awareness, dense social networks and contested collaboration in command and control. IPM 36, 461–479 (2000) 38. Stanney, K.M., Mourant, R., Kennedy, R.S.: Human factors issues in virtual environments: A review of literature. Presence: Teleoperators and Virtual Environments 7(4), 327–351 (1998) 39. Hockey, G., Sauer, J., Wastell, D.: Adaptability of Training in Simulated Process Control: Comparison of Knowledge- and Rule-based Guidance under Task Changes and Environmental Stress. Human Factors 49, 158–174 (2007) 40. http://www.teamorlando.org/gametech/downloads/2009/ presentations/Panel_Aguiar_Military_Applications_of_Virtual_ Worlds.pdf 41. Alberts, D., Garstka, J., Stein, F.: Network Centric Warfare: Developing and Leveraging Information Superiority. DoD C4ISR Cooperative Research Program (2000) 42. http://secondlife.com
Exploitational Interaction Manuel Garc´ıa–Herranz, Xavier Alam´an, and Pablo A. Haya AmILab, Ambient Intelligence Laboratory, E.P.S Universidad Aut´ onoma of Madrid
[email protected]
Abstract. Exploitational Interaction is an accessibility and control paradigm to allow individuals to make full use of the technologies around them, to exploit their combination and possibilities and adapt and shape their surroundings to their benefit. Focusing on programming personal environments as a pushing problem for the future home and facing the challenge presented by the diversity of elements, users, needs and skills, this paper proposes a new way of designing programming systems to balance control and accessibility according to the user’s needs and skills. Articulated along simplification, modularization and reutilization we present a rule–based language and multi–agent programming system for Exploitational Interaction, analyzing the most significant experiences and results of this five year project.
1
Introduction Bruscamente tierna aletea la paloma a ras de tierra sabiendo su final, y suena la piedra hecha timbal de sus alas.1
Ninety years old, she lives alone in her own house. Writing poetry, reading the newspaper, solving crosswords, checking the stock market and taking a nice variety of pills are part of her routine. She has recently been asking questions about the Internet. It seems to be everywhere, every media speaks about it and she is curious, she has always been, it is in her nature. The new iPad seemed a good tool to introduce her to the Internet since she has never ever touch a computer —though she has seen pictures on her grandson’s last Christmas. It is important not to scare her off; she deserves to know the Internet (she has much to give too) and her interest is, most probably, a one–shot opportunity. After preparing an iPad with just three buttons on the screen (i.e. e-books, newspaper and a web browser), she is left with it for about a week. Two outstanding results. First, she keeps licking her index finger before turning the pages of the e-book. Second, she hangs out with her friends the next Thursday. First time in 2 years. 1
Brusquely tender flutters the dove in the ground knowing near the end, and echoes the stone turned of its wings into drum.
Y. Cai (Ed.): Computing with Instinct, LNAI 5897, pp. 119–142, 2011. c Springer-Verlag Berlin Heidelberg 2011
120
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
Two thoughts arise from these results: – How profoundly it can change our lives to gain control over existing technologies and how overwhelming the impact of providing natural tools to grasp this control can be. – How deeply engraved in our mind and understanding are what we call natural concepts and what this means for an embodied computer to assume the history, records and personality of its new body. This, sometimes, means to be licked. This experience points at a fundamental problem: Exploitation. Traditionally, Control and Accessibility have been respectively associated with complex processes and collectives with strong special needs but which also belong to common situations of everyday life. Setting an alarm clock, calling a friend, kissing someone or accessing the Internet are actions bounded to whether they are possible and whether we know how to accomplish them. In other words, what are the possibilities and how can we exploit them? The former refers to the available resources and the use they can provide. The latter, as in the initial experience with the iPad, refers to whether or not there are tools that allow our different capabilities to use them. A very significant scenario for this kind of problem is personal environments, for their diversity in elements, users and uses. Le Corbusier famously said in 1923, “a house is a machine for living in”. Now, in the computer era, the machine concept has evolved and the number of technologies present in the home has increased exponentially since then. In the last decades we have seen technology expand from water, gas and electricity to radios, phones, TVs, alarm clocks, air conditioning, heating systems, VCRs, irrigation systems, security alarms, automatic gates, electric blinds, PCs, mobile phones, TiVo, game stations, sensors, the Internet and many more. Revolutionizing the environment, these technologies provide many useful services but, on the other hand, they picture an increasingly complex home, very capable but difficult to use, maintain and adjust to the inhabitants’ preferences. In response to that, many projects have created environments that automatically adapt to their inhabitants’ behaviors and preferences using machine learning and artificial intelligence techniques such as Artificial Neural Networks (ANN) [1], Multi–agent systems (MAS) [2], Hidden Markov Models (HMM) [3] and pattern mining [4] or Bayesian Networks [5][6]. These automatic systems focus on the inhabitants as creatures of habit [3][1], learn from what they see and try to anticipate what the user wants. Nevertheless, such approaches have their limitations. First of all, they require a reduction in the complexity of the data they learn from so they focus on specific domains of automation [1][2], a single inhabitant [3] or a single situation or activity at a time [5][4]. Second, both learn by observation and learn by example approaches have the problems stated by Youngblood et al. [3] of users not being able to do all they want to automate and not wanting to automate all they do. These problems are particularly important in supportive spaces for people with special needs in which the environment is supposed to do what the user simply cannot. Thus,
Exploitational Interaction
121
while they tackle the problem of accessibility, they do so by limiting the control power. Despite these limitations as a global solution for home automation, some of them are very well suited to deal with specific problems and therefore will be, most probably, part of the home of the future, picturing a smart house in which multiple technologies cohabitate in the environment to deal with different problems. That is, an enhanced version of today’s technologically heterogeneous homes. Thus, as we analyze today’s homes and today’s research for the coming ones, the idea of a future home in which every element and automation system is coherently designed by the same developer is simply unrealistic. Therefore, the problem of control in intelligent environments must face the integration and cooperation of an increasing number and diversity of devices and services as, in Newmans et al. words, we cannot expect to have available to us specific applications that allow us to accomplish every conceivable combination of devices that we might wish [7]. In this way, many projects have pursued a way to allow programming and interconnecting the different elements of the house, rather than allowing it to autonomously learn from observation, imagining, as Taylor, the environment’s intelligence to be about how clearly the environment’s workings are revealed to us, and how easily such workings can be harnessed [8]. Such projects include operating systems [9][10], development environments [11], frameworks [12] or toolkits [13], allowing professional developers to combine the different resources of the environment to the users’ benefit. Nevertheless, the possibility of control and its accessibility are different issues in the same way that the existence of the Internet and grandma accessing it are two different issues. While these systems considerably increase the control possibilities, the accessibility to such control is still far from the user. We believe that it is by providing the means for people to exploit the possibilities of their surroundings, by providing the right interaction tools, the tools for an Exploitational Interaction, rather than by adding new possibilities, that we truly impinge their lives. Exploitational Interaction is an holistic and synergic approach to harnessing the potentials of technology and its impact on people’s lives. It stresses control and accessibility by considering the heterogeneous set of existing technologies in an environment as a whole and their control as an adaptive process to the user’s needs and skills. It is based on the assumption that allowing users to access, control and combine existing technologies leverages its benefits in profound ways as it provides a direct way to address personal needs and a unique door for self–realization.
2
Control and Accessibility
In an environment rich in computational elements, control refers not only to being able to change the elements status but also their behaviors and dependencies, how they react to different stimuli or how they depend on one another.
122
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
In this sense, while users are rarely going to create new systems or services, they can radically modify the output of the home through accessing, controlling and combining existing ones. This is still a programming problem, the problem of programming the environment as a whole only unit and, therefore, the problem of designing the right language and programming structures. Accessibility, on the other hand, refers to how this control fits into the user’s abilities and understanding. Thus, designing a system to allow the user to program the environment demands different requirements than designing one for professional developers. In fact, different users will have different skills and, while professional programmers adapt themselves to the programming language that best fits their needs, it is the language that has to adapt itself to the user’s programming skills and needs in personal environments. Minsky elaborates instinctive reactions as a set of “if–then” rules [14] and Myers points at rule-based languages as the ones naturally used by people in solving problems [15]. Thus, while some systems have applied methodologies such as script piping [16][17] or graph creation [18] for end–user programming in smart environments, it is not surprising that many systems decided to exploit the naturalness of a rule–based approach for programming [19][20][21][22][23] [24][25][26][27]. Nevertheless, systems that are designed for simplicity restrict the domain of automation and severely limit the complexity of programs or the cohabitation of simultaneous users or activities [26][19][23]. Others, allowing complex programs to be built, impose that complexity to the overall system [22][21][25][27]. In summary, accessibility has been considered as a domain choice rather than as an adaptive parameter. Exploitational Interaction, on the other hand, requires the programming language and structure to adapt to the user’s needs and skills, to grow with them and ease the discovery of further possibilities. It should provide a minimum level of complexity, for users without programming skills, that can progressively grow with and promote the user’s abilities, and provide a high control power to fit advanced programmers’ needs. Therefore, Exploitational Interaction requires a new way of looking at programming systems as adaptive tools, one in which the complexity of the language is reduced to the minimum, discased of any dispensable concept and designed to build complex concepts as simple and affordable extensions of simpler ones. Focusing on Papert’s ideal of “low threshold no ceiling” [28], we propose a “matryoshka doll” approach to rule–based languages for end–user programming. In this approach, complexity is built in terms of simpler concepts and isolated into independent structures so that the required initial knowledge for programming is minimal, simple use leads to complex learning and complex structures can match professional expectations. A multiagent modular structure is then provided to allow cohabitation and coordination of multiple domains, people and preferences in the environment. The proposed multiagent platform, introduced in section 5, is driven by the analysis of complex systems and social organizations, thus allowing a user–centered approach from its foundations. Coordination of different agents is
Exploitational Interaction
123
done through the same rule–based language, unifying preferences and hierarchies into a single paradigm of progressive global control.
3
Minimal Kernel: A Discased Language
The problem of expression in programming languages, especially in end–user oriented ones is a user–dependent problem. Thus, while expert programmers find limitations in what the language can express, novice users find the limitations in what their knowledge allows them to. Providing a minimal kernel that maximizes the programming capabilities to non–programmers requires: – Focusing on the most natural structures used for programming to minimize the knowledge needed for programming. – Addressing the most common domains of programming to maximize the outcome. While Myers pointed out rule–based systems as those being the closest to the end–user’s way of thinking and programming [15] Dey et al. [29] observed in an end–user programming study that 56.4% of all the rules involved objects or the state of objects, 19.1% activities, 12.8% locations and only 7.6% time. These two issues have been separately analyzed in the literature but have not been considered together when designing end–user programming systems for intelligent environments. For example, Augusto and Nugent’s Temporal Reasoning Language (TRL) [21] adds the complexity of dealing with time concepts to every other time–free rule to deal more than comfortably with the 7.6% of the tasks involving time, yet this makes the use of this language in an end–user centered control system more difficult. The SmartOffice system, as another example, forces programmers to split and distribute their programs into the different modules they use, imposing a previous knowledge of the technical architecture of the system to program it. In order to reduce the complexity of the overall system, we used an event free abstraction, modeling the environment as a set of entities with properties and relations [30][31]. This abstraction layer provides a middleware to remove the technical details of the environment, homogenizing technologies and grouping elements in their natural sets through a hierarchy of types such as persons, devices or locations. While events have been modeled separately from states in many control systems, we chose a simpler photographic model in which only states are modeled. Events are thus not explicitly represented but, as in the real world, can be inferred by a status changing. A java API and a subscription mechanism allow programs to be notified of changes in the variables of interest, thus solving the problem from the application developer point of view. From a rule–based perspective we decided to consider events and states as functional categories rather than as different elements of the world. Thus, the state of the TV can either function as a Trigger or as a Condition but a single static concept for the TV status will exist regardless of whether it has just been turned on or it has been on for three hours. This abstraction is captured by the Blackboard [30][31], a simple entity–property–relation model.
124
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
The natural language structure of rules (i.e. when... if... then) provides three different conceptual categories that already shape the element they contain, automatically assigning it a different function in the sentence, thus allowing events to be waived from the world model. Using this structure, we define the basic structure of the language as simple reaction rules since, as stated by Myers [15], they are preferred to transformation rules for end–users to program, in which each rule is divided into three parts, triggers, conditions and actions: – Triggers: Context variables whose change is responsible for activating the rule. – Conditions: a set of “context variable–value” pairs representing a context state that needs to be satisfied for detonating the action. – Action: a “context variable–value” pair to be set when, in a triggered action, all its conditions evaluate to true. These parts are structured according to the following template: WHEN trigger1 OR trigger2 OR ... CHANGES IF condition1 AND condition2 AND ... HOLD THEN DO action1 and action2 and ... ; Conditions and actions are constructed as three–part structures of the type < LHS > < operator > < RHS > to compare properties or relations’ values (e.g. = is equal to, > is bigger than or is != different than) and to affect them (e.g. := assign, += increment or -= decrement the value of a property or => create or =< delete a relation, property or entity), respectively for conditions and actions. See Table 3 for the detailed grammar. The minimal kernel is designed to allow simple structures to be built easily, minimizing the initial knowledge required to program the environment. A simple example, to turn off the alarm clock if it starts when the bed is empty, is expressed following the natural language structure “When the alarm clock starts, if the bed is empty then turn it off ”, composed of one trigger (the value of the alarm clock), two conditions (the alarm clock being on and the bed being empty), and the action (to turn it off) in the following structure. WHEN the device:alarm_clock:status CHANGES IF device:alarm_clock:status = ON AND furniture:bed:status = EMPTY HOLDS THEN DO device:alarm_clock:status := OFF ; Expertise and programming skills, even for the most basic preferences, result in more–or–less efficient structures. Nevertheless, control is reached. Implementing a toggle switch can be expressed, for example, using two rules:
Exploitational Interaction
125
WHEN switch:interruptor1:value CHANGES IF light:lamp_1:status = OFF HOLDS THEN DO light:lamp_1:status := ON ; WHEN switch:interruptor1:value CHANGES IF light:lamp_1:status = ON HOLDS THEN DO light:lamp_1:status := OFF ; Or, given that the light’s status property is binary, in a single rule using the ASSIGN NOT operator (=!). WHEN switch:interruptor1:value CHANGES THEN DO light:lamp_1:status =! light:lamp_1:status; The set of behaviors that can be expressed with the basic language is quite extensive, given that conditions may refer not only to property values but also to relationships between entities, such as “When Pablo enters the laboratory, greet him through the avatar Maxine”. WHEN person:Pablo:location CHANGES IF person:Pablo:location = room:lab_B403 HOLDS THEN DO avatar:maxine:say := "Hello Pablo" ; Properties and relations are indistinguishable from the user point of view since relations are treated as properties whose value is another entity such as “located at”, “belongs to” or “likes”. To test the easiness of both the Trigger/Condition separation and the Entity– Property–Relation model, a user study was conducted with over 30 Spanish speaking subjects between 17 and 66 years old with various professional and educational backgrounds. Categorized as with (P) and without (NP) previous programming experience, they were given a 5 minute training on the elements and programming language of a virtual environment. Asked to program the behaviors shown in a series of videos by filling a WhenIf-Then template, their results were analyzed to measure: I1) The correct differentiation of elements acting as triggers from those acting as conditions, I2) The correct assignment of the triggers and conditions sets to their corresponding boxes in the template, I3) The time used to program each behavior and I4) The perceived difficulty of the scenario. The results can be found in Table 1 and Table 2. Differentiation of Triggers and Conditions resulted as equally easy for both groups with 87,50% of NP and 90,77% of P with no significant statistical difference between their performance (p–value of 0.68). At the same time, while
126
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
Table 1. I1 (Differentiation of Triggers/Conditions) and I2 (Identification of Triggers/Conditions) results (in % for G=Good, M=Medium, B=Bad) for end–users with (P) and without (NP) programming knowledge and their corresponding Mann– Whitney U test p–value Differentiation
Identification
Good Medium Bad Good Medium
Bad
Non–Programmers
87.50% 6.25% 6.25% 57.5% 3.75% 38.75%
Programmers
90.77% 6.15% 3.08% 80.0% 9.23% 10.77%
p (statistical significance)
0.68
0.0029
Table 2. I3 (Time) and I4 (Difficulty) results (in % for minutes spent in doing the task, and E=Easy, M=Medium, H=Hard, I=Impossible for measuring the task difficulty) for end–users with (P) and without (NP) programming knowledge Time (minutes) 1
2
3
4
5
Difficulty Easy Medium High Impossible
Non–Programmers 18.8% 38.8% 16.3% 6.3% 18.8% 51.3% 40.0% 7.5% Programmers
29.3% 33.9% 26.2% 4.6% 4.6% 60.0% 29.2% 9.2%
0.0% 1.5%
the correct identification of the category of the sets resulted as being more complicated to NP (57,50% of perfect identification) than P (80,00%) showing a statistically significant difference (p–value of 0.0029). These results show that while the differentiation between triggers and conditions could be made naturally both by programmers and non programmers, the Spanish words for “when” and “if” are semantically close (as they are in English) and may be misleading among users unfamiliar with the inflexibility of computing languages, e.g. sentences such as “when my favorite TV show begins, if I am not at home...” can be also expressed as “If I am not at home when my favorite TV show begins...”. Thus, while triggers and conditions are easily differentiated, natural language does not make their identification easy to non–programmers. The time spent for programming each rule is similar in both cases between 2.18 minutes on average for P and 2.64 for NP. In addition, while most participants found the exercises to be easy or medium only a very small set of P found some of them to be extremely difficult which, when compared to their answers, corresponds to complicated solutions involving some sort of recursion or loops. NP users, unfamiliar with such concepts, found easier solutions that, while not as perfect as the NPs, were programmed in much simpler ways. In summary, a simple language allows users to provide reasonable solutions to a wide variety of problems and, while this simplicity allows inexperienced users to program without realizing the complexity of the problem at hand, experienced programmers
Exploitational Interaction
Table 3. Rule–based language’s grammar rule
‘::’ ? ‘⇒’ ‘;’
triggerlist
(‘’ )∗
trigger
element
conditionlist
(‘&& ’)∗
condition
<element> <element>
actionlist
(‘&& ’)∗
action
<element> <element> ? |
confidence
‘|’ INT
timer
‘TIMER’ ? ‘{’ ? ‘}’ ‘{’ ? ‘}’
timer end agent + timer end stat
timer end rule LINE COMMENT COMMENT
timer end rule
( ‘⇒’)? ‘;’
timer time
NAME
concurrence
INT
rulelist
+
comparator
EQUAL NOT EQUAL GREATER SMALLER GREATER OR EQUAL SMALLER OR EQUAL
operation
ADD PROPERTY ADD RELATION ASSIGN ASSIGN NOT CREATE ENTITY MINUS PLUS REMOVE RELATION
element
literal BB ENTITY BB ELEMENT
literal
NAME |INT
127
128
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
find some difficulties in optimizing complex solutions due to the limitations of such a simple tool. Identifying errors in user rules is as important as preserving the predictability of the home. Inexperienced programmers tend to program their preferences as they come, without an overall design or a proper test bench, thus it may not be surprising that their creations are not as appropriate as they thought they would be. This problem can be addressed through two different tools: an explanation mechanism, allowing them to understand the insights of the environments behaviors and its possible faults, and an automatic learning process to either adapt the end–users’ programs to what they really wanted to program or to spot in advance the possible failure points of a program as a refining process. ECA–rules, as an explicit desire for automatic behaviors, provide an open door for learning without breaking the predictability of the home. E.g. Once the lights have been programmed to be adjusted according to people getting in and out of the room, users will not be surprised if the lights go on and off as people get in and out of the room. Therefore automatic learning using the user’s programs as a learning guide will only improve the competence of an expected behavior. To this purpose, ECA–rules actions have been assigned with a confidence factor (CF), increased and decreased through a reinforcement process led by the reactions to the effect of the actions. Based on a time lapse after the execution of an action in which actions over the same element are considered to be corrective reactions, this mechanism allows measuring the appropriateness of a rule and provides a mechanism to identify ambiguous rules as those in which the CF is consecutively increased and decreased. Besides limiting the search space for disambiguation, offering a target rule and context sets in which the action was correct and corrected, this mechanism allows the prevention of contradicting loops. This is necessary as CFs will rapidly decrease in such cases, allowing identification of the risk and stopping their execution to take the proper measures.
4
Progressive Complexity
While the base language provides minimal knowledge accessibility to programming basic control structures, as control requirements and needs grow in complexity, accessibility becomes a more difficult challenge. We propose a modular approach to deal with complex concepts in which few basic extensions can be added to the language to be thereon progressively extended. By correctly choosing the extensions and reusing previous concepts it is therefore possible to reach levels of complexity similar to those required in professional environments such as Active Database Management Systems in a progressive way, dealing only with the complexity of the concepts strictly needed. 4.1
Generality and Anaphora: Sets
The first complex extension we propose is generality, which allows referring to elements not by their name but by their general properties. To deal with
Exploitational Interaction
129
generality, we introduced the wildcard in the language which, in its more basic form, is a substitution of the name of an entity by an asterisk ∗ to refer to any entity of the same type. For example, “if a light is on” can be encoded as the condition “light:*:status = ON”. Wildcards are further extended to support anaphora by using the $ sign to refer to the entity signaled by the wildcard. To allow multiple anaphora the $ sign is followed by a number, indicating which of the wildcards it refers to (i.e. 0 referring to the first ∗ appearing in the rule, 1 for the next one and so on). For example “if a light is on turn it off ” can be coded as the condition “light:*:status = ON” and the action “light:$0:status := OFF”. Dealing with generality, wildcards open a natural door to work with sets, thus, when conditions are evaluated on a wildcard, the set of elements the wildcard is referring to is restricted only to those meeting the condition. Using anaphora allows sets to be repeatedly filtered. For example, the set of lights turned on (“light:*:status = ON”) can be filtered to those turned on that are in M. Herranz’s location (“light:*:status = ON AND light:$0:location = person:mherranz: location”. If the set is empty after filtering, then there is no element in the environment able to meet all conditions and thus the rules is not executed. On the other hand, an action over a wildcard is executed on all the elements of the set, allowing an action on many elements at once. Thus, an automatic behavior, showing any message sent to M. Herranz in every available display at M. Herranz’s location, can be encoded in the following rule: WHEN message:* (a message) CHANGES IF message:$0:to = person:mherranz (the message is for person:mherranz) AND display:*:location = person:mherranz:location (there is at least one display in person:mherranz:location) AND display:$1:status = available (at least one display is available) THEN DO message:$0:to -> display:$1 (show the message on the available displays) ; Thus, the wildcard allows progressive work with generality, anaphora and sets not by introducing new concepts but by providing one with multiple dimensions. 4.2
Time and Event Composition: Recursive Composition
Probably the most significant example of multi-dimensional complex concepts is the TIMER. Introduced to deal with simple time concepts, TIMERS are designed
130
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
to provide a soft learning curve by reusing basic concepts, and to allow extremely complex structures to be build. TIMERS are introduced as a delay to an action. They are a new type of action, leaving the rest of the language untouched, in which it is possible to specify the amount of time to wait before executing the action as, for example, “Two minutes after I close the door turn off the light”. The TIMER base structure is of the type TIMER ending time {THEN DO actions}, in which the actions are executed after the ending time has elapsed, therefore the previous example can be codified as: WHEN door:mydoor:status IF door:mydoor:status THEN DO TIMER 2m {THEN DO (wait 2 minutes then ;
CHANGES = CLOSE HOLDS light:mylight:status := OFF} turn off my light)
TIMERS are designed to be progressively extended. First, they allow actions to be replaced with rules that can use conditions when the time has elapsed to express behaviors such as “Two minutes after I close the door, if the room is empty, turn off the light”. We refer to these rules as on finished rules: WHEN door:mydoor:status CHANGES IF door:mydoor:status = CLOSE HOLDS THEN DO TIMER 2m {THEN IF room:myroom:habitants = 0 HOLDS THEN DO light:mylight:status := OFF;} (wait 2 minutes then, if the room is empty, turn off my light) ; The second extension is to add an extra set of rules to be evaluated only when the timer is running. Referred to as on running rules, they can be added after the set of on finished rules ( TIMER ending time { THEN DO on finished rules} {DURING WHICH DO on running rules}) to express a “while” interval in which the user wants something to be done under certain conditions, such as “If during the two minutes after I close the door, my room becomes empty, turn off the light”: WHEN door:mydoor:status CHANGES IF door:mydoor:status = CLOSE HOLDS THEN DO TIMER 2m {THEN DO nothing} {DURING WHICH DO WHEN room:myroom:habitants CHANGES IF room:myroom:habitants = 0 HOLDS
Exploitational Interaction
131
THEN DO light:mylight:status := OFF; } ; Finally, a set of on load rules can be added after the on running rules, as well as a concurrence factor (the number of times a TIMER can be running simultaneously) in the form TIMER ending time concurrence {THEN on finished rules} {DURING WHICH on running rules} {BEFORE on load rules}. This form combined with the ability to modify and use the status of the TIMER (TIMER.pause, TIMER.start, TIMER.reset...) in the on running rules, as if it
Fig. 1. Incorporating conditions into event composition allows differentiating similar events based on the context in which they occur
Fig. 2. Illustration of consumption policies for composite event detection. Various consumption policies are compared with a “mixed” policy in which the composite event is designed to use the first instances of the initiator and terminator events but the last instance of the in–between events. Illustration inspired in [35].
132
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
was just another element of the environment, allowing the building of composite events (sequences, disjunctions, conjunctions, closure, periodic and so on) and consumption policies (recent, continuous or cumulative) as complex as those used in expert Database Management Systems [32], Interval–Based Event Calculus [33], Event Composition Operators [34] or Temporal Reasoning [21]. Traditional event compositions [34][32] define composite events in terms of simpler events and, according to a global consumption policy, for example, a composite event can be defined as the sequence (E1 ; E2 ), thus (E2 ) happening after (E1 ). Establishing a “Recent” consumption policy will determine that the most recent occurrences of (E1 ) and (E2 ) will be considered to compose (E1 ; E2 ). Thus in the sequence E1 , E1 , E2 , E2 , E1 and E2 will be considered. TIMERS provide a procedural mechanism to go beyond traditional event composition in which: – Conditions can be considered in the composition, allowing differentiation between similar events by specifying a particular context in which they must occur to be considered for being part of the composite event. Thus, a sequence (E1 ; E1 ) can be particularized so the first event occurs when a certain condition c holds, while in the second one ∼ c must hold (see Figure 1). – Consumption policies can be applied locally, to each raw event, allowing the mixture of different consumption policies to define a composite event (see Figure 2).
Fig. 3. Personal Ambient Intelligent Reminders: Interface for creating Ambient Intelligent reminders. It allows the creation of time-based rules, ordering activities and scheduling responses, hiding the complexity of timers to caregivers.
Exploitational Interaction
133
While a natural base language allowed the creation of a simple Drag and Drop GUI and a free–choice Fridge Magnet Poetry based GUI to create simple behaviors in the environment, the power of a complex event composition was used to create an interface to program reminders for people with Alzheimer, in which behaviors depend on whether an activity has taken place before others, within a time lapse or a combination of events and conditions (see Figure 3). In any case, the resulting rule can be modified and understood by any of them, preserving a structure similar to the original mental plans of the user. Autonomous learning is restricted to the automation domains implicitly open by user–defined rules. Thus competence of user–defined rules is strengthened while the overall behavior of the environment is kept within the user’s expectations.
5
Organic and Modular Programming Structure
As described in the previous sections, a new way of designing programming languages as progressive and modular structures allows control to progressively adapt to users’ needs and skills, balancing control and accessibility in a gradual, user–sensitive manner. Nevertheless, preferences are complex and interconnected structures, composed of several behaviors dependent on each other, arising and changing as individuals, conditions or domains of automation evolve. In this sense, structuring and coordinating behaviors are complex processes traditionally requiring profound analysis and thoughtful design. Nevertheless, people intuitively group preferences and needs, but they do so considering different patterns, intuitively associating responsibilities according to those groupings. In addition, multiple users with different preferences may share a single environment. Thus, personal environments are dynamic spaces, strongly influenced by social factors managed in a human level by establishing hierarchies, spreading responsibilities or defining ownership by means existing long before any kind of computing technology populated their places. This status quo conflicts with an automatic system or a third human party specifying the hierarchies, for we aim to allow each social group to express their own. This apparently naive principle poses a profound problem, since any fixed structure will interfere with the different natural structures of users. Therefore, the structure must be flexible enough to adapt to the natural structures and organizations of social groups. Some systems have handled the coordination problem through classifying, organizing and structuring. Kakas et al. [36] provide each agent of a multi–agent system with two types of priority rules: Role priorities and Context priorities. Role rules prioritize according to the function of the parts involved, while context rules prioritize role rules according to some extra context (some sort of task hierarchy). Additionally, each agent is supplied with a motivation structure, that is a hierarchy over the goals it pursues, defined according to Maslow’s needs categorization [37]. Finally, an additional hierarchy is specified to define the agent’s “personality” (i.e. its decision policy on needs to accomplish the goals of its motivation).
134
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
While this kind of structuring captures many of the flavors of human behavior, it presents some problems when applied to personal environments. First, it needs a professional (expert in Maslow’s theory) to program the system, and a clear a priori classification of the users’ goals, roles and tasks. These engineering solutions, while perfectly valid for domains such as business automation, are not suited for home environments in which they impose an overly rigid categorization, the need of a deep a priori definition or a third party programmer. Other techniques such as AHP (i.e. Analytic Hierarchy Process)or ANP (i.e. Analytic Network Process) define structured process to prioritize preferences based on heterogeneous criteria and the user’s judgment. However, they require a deep and time consuming analysis of the overall system. Since coordination is, nevertheless a crucial part of any programming environment, balancing control and accessibility requires not only redesigning the programming language but also applying the same design principles to the structuring and coordinating structures. 5.1
Structuring
As with the programming language, we choose a simple initial structure that can be further expanded to gain complexity. This basic structure is the agent. ECA rules (i.e. behaviors) can be distributed among different agents. As each one of them is an independent reasoning engine, agents not only help in structuring and modularizing an increasing set of rules, but they also distribute the inference and computing cost among different processes and machines. From the user’s point of view, agents are seen as the virtual equivalent of a butler in that they can “hire” and command as many as desired. Agents can be activated and deactivated independently, allowing them to manage sets of rules as a single entities. To reinforce the understanding and awareness of agents as butlers or assistants, they are represented as another part of the environment, modeled as entities (of the type agent ) with a status property and an owner relation linking it to the person or groups of persons for which it works. The status defines whether the agent is active or inactive, allowing the user to check which sets of behaviors are currently running in the environment in real time. The owner relationships allow the tracking of responsibilities back to the human level. In order to enrich the structure, an optional task property can be defined to specify the abstract goal of the agent (e.g. lighting, security, gardening), allowing users to translate their own categorizations to the agent system. Similarly, an optional location relation links the agent to the physical bounds within which it acts. Finally, while the inner rules of the agent are not modeled in the Blackboard middleware, a set of affect relations links the agent with all the elements that may be modified by it, those appearing in the actions part of its ECA–rules (see Figure 4). This mechanism allows the modularization and categorization of behaviors according to the end–users’ mental plan, as well as allows scrutiny of the system, tracking the affects relationships from the elements to the agents in order to find responsibilities and spot possible conflicts, or querying the environment in many
Exploitational Interaction
135
Fig. 4. Representation of an agent, properties and relations in the Blackboard
different ways to check how many agents of some user are running in a specific location, regarding a particular task or affecting a particular device. 5.2
Coordinating
Coordinating preferences is closely related to the problem of creating hierarchies. Multiple users inhabiting the same space make the interaction dependent on the remaining users and their preferences. Hierarchies are the natural social structures for establishing an order of preference, but, linked as they are to the social group that created them, they are multiple and dynamic and their complexity reflects the complexity of the social group they rule. To allow both control and accessibility, we want to be able to allow the creation of hierarchies as complex as the social structures they govern but without imposing, especially to the simpler scenarios, a profound analysis or an a priori knowledge to build their own. By modeling the agents as another part of the environment, we allow end– users to create behaviors that do not affect a physical element of the environment but an agent instead. Thus, a simple hierarchy, a preference over preferences, can be expressed using the same mechanism described in the previous sections as rules such as “When I leave a room, deactivate all my agents located in that room”: WHEN person:Manuel:location CHANGES IF agent:*:owner = person:Manuel AND
136
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
agent:$0:status = ACTIVE AND agent:$0:location != person:Manuel:location HOLDS (if there are any of Manuel’s agents active and not in person:Manuel:location) THEN DO agent:$0:status := INACTIVE (deactivate those agents) ; Or “When Pablo’s TV agent activates, deactivate all other agents affecting the TV”: WHEN agent:PabloTV:status CHANGES IF agent:PabloTV:status = ACTIVE AND agent:*:affects = device:TV AND agent:$0:status = ACTIVE HOLDS (if there are any active agents affecting the TV) THEN DO agent:$0:status := ACTIVE (deactivate them) ; Based on the assumption that conflicts and hierarchies are strictly bounded to the social group at hand, this mechanism does not prevent conflicts, but allows easy programming of the solutions users find when they arise. In order to prevent conflicts, a priority queue mechanism is established in the Blackboard to define default policies over the elements of the environment [38]. Once the basic brick for programming hierarchies is defined, how do we measure the different degrees of complexity the system allows? Y. Bar–Yam’s [39] interdependence and scale concepts to measure complexity are quite useful in this undertaking since they provide simpler measures to categorize the overall complexity. Interdependence describes the effects a part has over the rest of the system, how the parts are interconnected and how they depend on each other. Scale, on the other hand, refers to the different degrees of complexity a system shows depending on how close or far removed the observer is. That is, Bar–Yam considers complexity as a subjective measure that depends not only on the system, but also on the distance between the observer and the system. The important point of scale is not only that it changes the perceived complexity of the system, but also that this change characterizes the system: “the variation of complexity as scale varies can reveal important properties of a system”. Using these variables, we can identify them in the programming system, categorizing the underlying preferences network and the nature of the social structure that program it.
Exploitational Interaction
137
Interdependence is easy to see in the Blackboard model as the graph created by all the affects relations. Depending on the scenario, this graph ranges from pyramidal structures to unconnected graphs or entangled networks, with a person or persons behind each agent, an element of the environment in every leaf of the graph and, in between, a complex structure of conditions that, as a whole, governs the overall automatic behavior of the environment (see Figure 5). Scale, on the other hand, can be appreciated in the different levels in which hierarchies can be expressed.
Fig. 5. An example of interdependence in the graph created by the connections between people and their agents (is owner), and the agents with the objects they affect (affects)
To illustrate this with an example, let us consider two users sharing a house. User A prefers the light level to be low while user B prefers it high. In this situation, they can control their preferences through a single agent (associated with both of them) in which three rules codify their preference: “if user A is in the house but not B, when watching TV, set the light level to low”, “if user B is in the house but not A, when watching TV, set the light level to high” and “if both A and B are in the house, when watching TV, set the light level to medium”. Conversely, they can have an agent for each, codifying their personal preferences, another shared agent codifying their mutual preferences (i.e. what they want when they are together) and a meta–agent deactivating their personal
138
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
agents and activating the shared one when both of them are in the house, and vice versa. Or, finally, they can do without agents and establish a default policy in the Blackboard to establish the average as the default value for the light when a conflict arises. While codifying the same behavior, the three approaches present a different interdependence and scale and they will be preferred over the others according to the idiosyncrasy of the social group. Thus, the former will be more frequent in situations in which most of the preferences are shared (e.g. a couple sharing a house), the second when each individual normally decides alone and some coordinating mechanism is required (e.g. student roommates), while the third one is more natural to sporadic environments in which personal preferences are secondary (e.g. a laboratory hallway). These structures have been observed in the three environments in which the system has been deployed: a simulated living– room in the AmILab laboratory (Autonomous University of Madrid, Spain), a simulated security chamber at Indra’s facilities (Madrid) and an intelligent classroom in the Itechcalli laboratory (Zacatecas, Mexico) as well as in other programming experiences [40]. In conclusion, considering the programming language and the multi–agent structure as a whole, a progressive design balancing control and accessibility allows users to exploit the possibilities of the environment progressively. With an extremely low threshold and a smooth learning curve, the environment is progressively populated with behaviors. Simple scenarios begin as an unorganized set of rules that are later on refined and structured as the scenario gains in complexity and users in skills. This can be seen in Figure 6, a graph showing the evolution of persons, rules and agents in the AmILab living–room from 2006 to 2009. A simple set of domains of automation is defined in 2006 to deal with the short bunch of preferences of two users, mainly to personalize the switch, lights and access of the environment. In 2007 the domains of automation have grown to include making coffee and a meta–agent structure that personalizes the behavior of the switch depending on who is pressing it. 2008 shows a significant increase in the number of people in the environment that, in combination with a set of sport events (such as the Eurocup) and new ongoing projects, leads to an increase in the number of agents (more domains of automation) and rules (more preferences to code) in the environment. Finally, in 2009 the number of agents is decreased again, as the preferences related to some of 2008 events no longer apply. In addition, the number of rules of many agents is reduced as the result of users beginning to use wildcards (generality) to express personalized behaviors shared by more than one user. The evolution over these four years shows a self–regulating process in which the user’s programming effort develops progressively as experience is gained and required by the complexity of the scenario, balancing control and accessibility to exploit the resources of the environment according to the users’ needs.
Exploitational Interaction
139
Fig. 6. Evolution on the number of agents, rules, rules per agent and people programming them in the Universidad Aut´ onoma de Madrid’s Ambient Intelligence laboratory from 2006 to 2009. The number of agents and rules grows as new domains of automation and preferences are tackled. Nevertheless, 2009 shows a significant decrease in the number of rules since the inhabitants begin using wildcards to group several preferences under a general one. Complex concepts, if correctly designed, are acquired through the use of simpler ones (not through training) and used when really needed as, in this case, having to deal with several people with similar preferences.
6
Conclusions
As technology evolves and spread throughout our lives, personal environments present a significant challenge. As organic spaces they lack the thoughtful design and specific goals of professional environments, their elements are heterogeneous and varied and their population presents a wide variety of needs and skills. Focusing in programming as a fundamental problem in reconfiguring and adapting technologies to the user needs, we have presented Exploitational Interaction as a new way of designing interaction systems to balance control and accessibility according to the user’s needs and skills. Exploitational Interaction requires a new way of designing programming languages and structures in which the user skills and understandings are the main driving forces. Thus we have used simplification, modularization and reutilization as tools for providing a low initial threshold, isolated complexity and ease the learning curve both in the language and programming structure. We have presented an event–free representation of the world and a simplified Trigger–Condition–Action rule structure as starting programming point, showing how triggers and conditions can be naturally classified but their identification is bounded to the ambiguity of Spanish and English natural language. This phenomenon supports our simplicity claims but opens a door for clarification to
140
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
interface designers. Thus, the base language has been designed to avoid complex or less common concepts imposing their complexity on simple designs. Complex concepts are then introduced in their more simple form as extensions of the base language and designed to gain complexity in a recursive process of introducing new simple concepts or reusing already known ones. The complexity that can be achieved with such systems has been compared with that of Event Algebras and Consumption Policies for Event Composition in professional contexts allowing the user to not only to match their capabilities but also to create flexible structures such as context–dependent composite events or mixed consumption policies more naturally. The importance of Event Composition for personal environments has been shown in an application designed to allow caregivers with no programming background to program assistive reminders for people with special needs. Finally, considering personal environments as multi–domain and multi–person spaces, we have argued the necessity of providing a free–choice mechanism for modularization and the necessary means to allow users to transfer their natural hierarchies to the Intelligent Environment. Simplification, modularization and reutilization have been used again to design the programming structure, presenting the agent as the initial structure. Agents have been designed as a free–choice classification structure allowing users to organize their preferences while reducing the computational and reasoning costs of the overall system by splitting it into independent modules. Using agents as another part of the context allowed us to reuse the programming language to create coordination structures provides the necessary means for users to program their own hierarchies according to the same principles established for the language design. Since the hierarchies governing each social organization are entangled with the complexity of that social organization, the problem of allowing different types of hierarchies has been reformulated as allowing different types of complexity. Bar–Yam’s concepts of interdependence and scale [39] have been proposed to define the complexity of such solutions, using as testing experiences the three environments in which the system has been deployed. A final experiment has been presented in which the evolution of agents and rules over four years in an environment can be seen, showing a self–regulating process in which the users’ programming effort develops progressively as experience is gained and the complexity of the scenario requires it.
References 1. Mozer, M.M.: The neural network house: An environment that adapts to its inhabitants. In: Proceedings of the AAAI Spring Symposium on Intelligent Environments. AAAI Press, Menlo Park (1998) 2. Lesser, V., Atighetchi, M., Benyo, B., Horling, B., Raja, A., Vincent, R., Wagner, T., Xuan, P., Zhang, S.X.: The UMASS intelligent home project. In: Etzioni, O., M¨ uller, J.P., Bradshaw, J.M. (eds.) Proceedings of the Third Annual Conference on Autonomous Agents (AGENTS 1999), May 1-5, pp. 291–298. ACM Press, New York (1999)
Exploitational Interaction
141
3. Youngblood, G.M., Cook, D.J., Holder, L.B.: Managing adaptive versatile environments. Pervasive and Mobile Computing 1(4), 373–403 (2005) 4. Rashidi, P., Cook, D.: Keeping the resident in the loop: Adapting the smart home to the user. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 39(5), 949–959 (2009) 5. Brdiczka, O., Reignier, P., Crowley, J.L.: Supervised learning of an abstract context model for an intelligent environment, smart objects and ambient intelligence. In: SOC-EUSAI 2005, Grenoble (2005) 6. Dey, A.K., Hamid, R., Beckmann, C., Li, I., Hsu, D.: a CAPpella: programming by demonstration of context-aware applications. In: Proceedings of ACM CHI 2004 Conference on Human Factors in Computing Systems, vol. 1, pp. 33–40 (2004) 7. Newman, M.W., Sedivy, J.Z., Neuwirth, C., Edwards, W.K., Hong, J.I., Izadi, S., Marcelo, K., Smith, T.F.: Designing for serendipity: supporting end-user configuration of ubiquitous computing environments. In: Symposium on Designing Interactive Systems, pp. 147–156 (2002) 8. Taylor, A.: Intelligence in Context. In: International Symposium on Intelligent Environments, Cambridge (United Kingdom), Microsoft Research, April 5-7, pp. 35–44 (2006) 9. Rom´ an, M., Hess, C.K., Cerqueira, R., Ranganathan, A., Campbell, R.H., Nahrstedt, K.: Gaia: A middleware infrastructure to enable active spaces. IEEE Pervasive Computing, 74–83 (October-December 2002) 10. Ballesteros, F.J., Soriano, E., Muzquiz, G.G., Algara, K.L.: Plan B: Using files instead of middleware abstractions. IEEE Pervasive Computing 6(3), 58–65 (2007) 11. Helal, S., Mann, W., El-Zabadani, H., Kaddoura, Y., Jansen, E.: The gator tech smart house: A programmable pervasive space. IEEE Computer 38(3), 50–60 (2005) 12. Bardram, J.: The java context awareness framework (jcaf)-a service infrastructure and programming framework for context-aware applications. Pervasive Computing, 98–115 (2005) 13. Dey, A.K., Salber, D., Abowd, G.D.: A conceptual framework and a toolkit for supporting the rapid prototyping of context-aware applications. Human-Computer Interaction (HCI) Journal 16(2-4), 97–166 (2001) 14. Minsky, M.: The Emotion Machine. Simon & Schuster, New York (2006) 15. Myers, B.A., Pane, J.F., Ko, A.: Natural programming languages and environments. Commun. ACM 47(9), 47–52 (2004) 16. Hague, R.: End-user programming in multiple languages. Technical report ucam-cltr-651, phd thesis, University of Cambridge, Computer Laboratory (October 2005) 17. Rodden, T., Crabtree, A., Hemmings, T., Koleva, B., Humble, J., ˚ Akesson, K.P., Hansson, P.: Between the dazzle of a new building and its eventual corpse: assembling the ubiquitous home. In: Benyon, D., Moody, P., Gruen, D., McAraMcWilliam, I. (eds.) Conference on Designing Interactive Systems, pp. 71–80. ACM, New York (2004) 18. Mavrommati, I., Kameas, A., Markopoulos, P.: An editing tool that manages device associations in an in-home environment. Personal and Ubiquitous Computing 8(34), 255–263 (2004) 19. Want, R., Schilit, B., Adams, N., Gold, R., Petersen, K., Goldberg, D., Ellis, J., Weiser, M.: An overview of the parctab ubiquitous computing experiment. IEEE Personal Communications 2(6), 28–43 (1995) 20. Schmidt, A.: Implicit human computer interaction through context. Personal and Ubiquitous Computing 4(2/3) (2000) 21. Augusto, J.C., Nugent, C.D.: The use of temporal reasoning and management of complex events in smart homes. In: de M´ antaras, R.L., Saitta, L. (eds.) ECAI, pp. 778–782. IOS Press, Amsterdam (2004)
142
M. Garc´ıa–Herranz, X. Alam´ an, and P.A. Haya
22. Bischoff, U., Kortuem, G.: Rulecaster: A programming system for wireless sensor networks. In: Havinga, P.J.M., Lijding, M.E., Meratnia, N., Wegdam, M. (eds.) EuroSSC 2006. LNCS, vol. 4272, pp. 262–263. Springer, Heidelberg (2006) 23. Kulkarni, A.: A reactive behavioral system for the intelligent room. PhD thesis, Massachusetts Institute of Technology (2002) 24. Wang, X.H., Zhang, D.Q., Gu, T., Pung, H.K.: Ontology based context modeling and reasoning using owl. In: Proceedings of PerCom 2004, Orlando, FL, USA, pp. 18–22 (March 2004) 25. Nieto, I., Bot´ıa, J., G´ omez-Skarmeta, A.: Information and hybrid architecture model of the ocp contextual information management system. 12(3), 357–366 (2006) 26. Cheverst, K., Byun, H., Fitton, D., Sas, C., Kray, C., Villar, N.: Exploring issues of user model transparency and proactive behaviour in an office environment control system. User Modeling and User-Adapted Interaction 15(3), 235–273 (2005) 27. Gal, C.L., Martin, J., Lux, A., Crowley, J.L.: Smartoffice: Design of an intelligent environment. IEEE Intelligent Systems 16(4), 60–66 (2001) 28. Papert, S.: Mindstorms: Children, Computers, and Powerful Ideas. Basic Books, New York (1980) 29. Dey, A., Sohn, T., Streng, S., Kodama, J.: iCAP: Interactive prototyping of contextaware applications. In: Fishkin, K.P., Schiele, B., Nixon, P., Quigley, A. (eds.) PERVASIVE 2006. LNCS, vol. 3968, pp. 254–271. Springer, Heidelberg (2006) 30. Haya, P.A.: Tratamiento de Informaci´ on Contextual en Entornos Inteligentes. PhD thesis, Universidad Aut´ onoma de Madrid (2006) 31. Haya, P.A., Esquivel, A., Montoro, G., Garc´ıa-Herranz, M., Alam´ an, X., Herv´ as, R., Bravo, J.: A prototype of context awareness architecture for ambience intelligence at home. In: International Symposium on Intelligent Environments, Cambridge, United Kingdom, Microsoft Research, pp. 49–55 (2006) 32. Paton, N.W., Diaz, O.: Active database systems. ACM Computing Surveys 31(1), 63–103 (1999) 33. Paschke, A.: The reaction ruleML classification of the event / action / state processing and reasoning space. Technical report (November 10, 2006) 34. Rafatirad, S., Gupta, A., Jain, R.: Event composition operators: Eco. In: Proceedings of the 1st ACM International Workshop on Events in Multimedia, EiMM 2009, pp. 65–72. ACM, New York (2009) 35. Chakravarthy, S., Krishnaprasad, V., Anwar, E., Kim, S.K.: Composite events for active databases: Semantics, contexts and detection. In: Proceedings of the Twentieth International Conference on Very Large Databases, Santiago, Chile, pp. 606–617 (1994) 36. Kakas, A.C., Moraitis, P.: Argumentation based decision making for autonomous agents. In: AAMAS, pp. 883–890. ACM, New York (2003) 37. Maslow, A.H.: Motivation and Personality. Harper, New York (1954) 38. Haya, P.A., Montoro, G., Esquivel, A., Garc´ıa-Herranz, M., Alam´ an, X.: A mechanism for solving conflicts in ambient intelligent environments. Journal Of Universal Computer Science 12(3), 284–296 (2006) 39. Bar-Yam, Y.: Analyzing the effectiveness of social organizations using a quantitative scientific understanding of complexity and scale. NECSI Technical Report (May 2007) 40. Garc´ıa-Herranz, M., Haya, P.A., Alam´ an, X., Mart´ın, P.: Easing the smart home: augmenting devices and defining scenarios. In: 2nd International Symposium on Ubiquitous Computing & Ambient Intelligence - 2007 (2007) (Best paper award)
A Middleware for Implicit Interaction M.J. O’Grady, J. Ye, G.M.P. O’Hare, S. Dobson, R. Tynan, R. Collier, and C. Muldoon CLARITY: Centre for Sensor Web Technologies, School of Computer Science & Informatics, University College Dublin, Belfield, Dublin 4, Ireland {michael.j.ogrady,juan.ye,gregory.ohare,simon.dobson, richard.tynan,rem.collier,conor.muldoon}@ucd.ie
Abstract. Achieving intuitive and seamless interaction with computational artifacts remains a cherished objective for HCI professionals. Many have a vested interest in the achievement of this objective as usability remains a formidable barrier to the acceptance of technology in many domains and by various groups within the general population. Indeed, the potential of computing in its diverse manifestations will not be realized fully until such time as communication between humans and computational objects can occur transparently and instinctively in all instances. One step towards achieving this is to harness the various cues that people normally use when communicating as such cues augment and enrich the communication act. Implicit interaction offers a model by which this may be understood and realized; however, implementing a solution that effectively harnesses implicit interaction is problematic. This chapter presents an intelligent middleware framework as a means for harnessing the disparate data sources necessary for capturing and interpreting implicit interaction events.
1 Introduction It is acknowledged that intuitive interaction is fundamental to the success of computing services. How such interaction is achieved is open to question. And the need for an answer to this question is increasingly urgent, given the paradigm shift that is ongoing towards pervasive computing. In the original manifesto for ubiquitous computing in the early 1990s, the need for seamless and intuitive interaction was explicitly acknowledged. How it was to be achieved was not stated. A decade later, the Ambient Intelligence initiative proposed that Intelligent User Interfaces (IUIs) would solve this problem. Another decade has passed and the question remains open. In this chapter, we present a holistic view of interaction be adopted, encompassing its explicit and implicit components. Realizing this in practice is computationally complex; nevertheless, developments in sensor and related technologies are enabling hardware and software platforms from which this vision of interaction may be attained in practice. Y. Cai (Ed.): Computing with Instinct, LNAI 5897, pp. 143–161, 2011. © Springer-Verlag Berlin Heidelberg 2011
144
M.J. O’Grady et al.
1.2 How People Interact If Intelligent User Interfaces (IUIs) [1] that truly enables intuitive instinctive interaction are to be developed, an innate understanding of how people communicate is essential. It is useful to reflect on this briefly. Humans communicate using a variety of means - verbal being a prominent communication modality. Yet nonverbal cues (behavioral signals), for example frowning, have over four times the effect of verbal cues [2]. For interfaces to act intelligently and instinctively, non-verbal cues need to be incorporated into their design and implementation. One well-known classification of non-verbal behavior is that of Ekman & Friesen [3] who have listed 5 categories: • Emblems: actions that carry meaning of and in themselves, e.g. a thumbs up. • Illustrators: actions that help listeners better interpret what is being said, for example, for example, finger pointing; • Regulators: actions that help guide communication, for example head nods; • Adaptors: actions that are rarely intended to communicate but that give a good indication of physiological and psychological state; • Affect: actions that express emotion without the use of touch, for example, sadness, joy and so on. Thought this categorization has proved popular, it does not capture all kinesic behaviors, eye behavior being one key omission. Nevertheless, the model does give some inkling as to the complexity of the problem that must be addressed if instinctive interaction between humans and machines is to be achieved. Capturing all kinesic behavior is desirable as studies indicate that human judgments were more accurate when based on a combination of face, body, and speech than when just using face and body [4] though the contribution of each may be dependent on the prevailing context. In practice, it may not be feasible in all circumstances to capture all cues; thus it may be necessary to work with a subset of the available behavioral cues. This need not be an insurmountable problem as some cues may be more important than others. Certainly the visual channel, through the reading of facial expressions and body postures, seems to be the most important as the face represents the dominant means of demonstrating and interpreting affective state [5]. Though speech is essential for communication in normal circumstances, interpreting the affective state from both its linguistics and paralinguistic elements is inherently difficult and much work remains to be completed before the emotional significance of these elements can be extracted and identified with confidence [6]. Finally, the issue of physiological cues need to be considered briefly as these almost invariably indicate an affective state. Examples of such cues might include increased heart rate, temperature and so on. Unless having undergone specific training, most people would not be capable of sensing such cues. Wearable computing offers options for harvesting such cues as garments embedded with sensors for monitoring heart and respiratory rate amongst others are coming to market. However, whether people will wear such garments in the course of their normal everyday activities is open to question, as is the issue of whether they would share this data with a third-party.
A Middleware for Implicit Interaction
145
2 Interaction Modalities As has been described, human communication encompasses many modalities and the depth of complexity that requires significant research is still required if the underlying complexity is to be resolved. This poses significant challenges for those who aspire to develop computing systems that can successfully capture and interpret the various cues and subtleties inherent in human communications. As a first step towards this, many researchers have focused on multimodal human computer interaction and see it as a promising approach to facilitating a more sophisticated interaction experience. 2.1 Multimodal Interaction Multimodal interaction in a HCI context harnesses a number of different communication channels or modalities to enable interaction with a computer, either in terms of input or output. The key idea motivating multimodal interaction is that it would appear to be a more natural method of communications as it potentially allows the capturing and interpretation of a more complete interaction context, rather than just one aspect of it. A key question then arises – does research support or contradict this motivation? One well cited study by Oviatt [7], has systematically evaluated multimodal interfaces. It was shown that multimodal interfaces speeded up task completion by 10%, and that users made 36% fewer errors than with a unimodal interface. Furthermore, over 95% of subjects declared a preference for multimodal interaction. Though these results are impressive, it must be remembered this study focused on interaction in one domain, namely that of map based systems. Whether these results can be generalized to other domains and different combinations of modalities remains to be seen. To gain a deeper understanding of this issue, it is useful to reflect on how multimodal interaction is defined. Sebe [8] defines modality as being “a mode of communication according to the human senses and computer input devices activated by humans and measuring human qualities”. For each of the five human senses, a computer equivalent can be found. Microphones (hearing) and cameras (sight) are well established. However, haptics (touch) [9], olfactory (smell) [10] and taste [11] may also be harnessed, albeit usually in specific domains and circumstances. For a system to be considered multimodal, these channels must be combined. For example, a system that recognizes emotion and gestures using one or multiple cameras would not be multimodal, according to this definition, whereas one that used a mouse and keyboard would be. This definition is strongly influenced by the word “input”. A less rigorous interpretation might be to consider not only modality combinations but also select attributes of individual modalities. In either case, the problem facing the software engineer is identical. Individual input modalities must be parsed and interpreted, and then the final meaning of the interaction estimated though a fusion process. How this may be achieved is beyond the scope of this discussion; however one technique proposed involves the use of weighted finite state devices, an approach that is lightweight from a computational perspective and is thus suitable for deployment of a range of mobile and embedded devices [12]. Ultimately, the question that is being posed for an arbitrary system is
146
M.J. O’Grady et al.
Fig. 1. At any time, a user may display a range of non-verbal cues that give a deeper meaning to an arbitrary interaction. Identifying and interpreting such cues is computationally complex but a prerequisite to instinctive interaction.
what is it that the user intends (Fig. 1). While this may never be known with 100% certainty, the harnessing of select cues that are invariably used in communication may contribute to more complete understanding of the user’s intent thereby leading to a more satisfactory interactive experience. 2.2 Interaction as Intent In most encounters with computing, interaction is explicit – an action is undertaken with the expectation of a certain response. In computational parlance, it is event driven. The reset button is pressed and the workstation reboots. This is the default interaction modality that everyone is familiar with, even in non-computing scenarios. When designing interfaces, a set of widgets is available that operates on this principle. No other issue is considered. The application is indifferent to emotions and other contextual parameters. Should the context be available, a number of options open up but the appropriate course of action may not be obvious in all circumstances. If it is determined that the user is stressed for example, is the designer justified in restricting what they can do? Should certain functionality be temporarily suspended while certain emotions are dominant? If applications are to act instinctively, the answer is probably yes; thus embedded applications will have to support multimodal I/O. Interaction may also be implicit, and it is here that non-verbal cues may be found. All people communicate implicitly. The tone of peoples’ voices, the arched eyebrow and other facial expressions reinforce what they say verbally. Intriguingly, it can also contradict it. Though people can seek to deceive with words, gestures can indicate when they do so. Thus if we seek interactions that are based on truth, an outstanding challenge is to harness and interpret implicit interaction cues. This is computationally complex, requiring that such cues be captured, interpreted and reconciled in parallel with explicit interaction events. One subtle point with implicit interaction is that it can, in certain circumstances, represent the direct opposite of explicit interaction. In short, what is NOT done, as
A Middleware for Implicit Interaction
147
opposed to what is done, may indicate a choice or preference. For example, in ignoring an available option, users may be saying something important about their preferences. What this means is of course domain and context dependent. In all but the simplest cases, implicit interaction is multimodal. It may require the parallel capture of distinct modalities, for example, audio and gesture. Or it may require that one modality be captured but be interpreted from a number of perspectives. For example in the case of the audio modality, semantic meaning and emotional characteristics be may extracted in effort to develop a deeper meaning of the interaction. Historically, implicit interaction was first defined by Schmidt [13] who believed that ongoing developments in computing would ultimately result in implicit interaction being a viable alternative to the traditional explicit modality. Furthermore, it was observed that implicit interaction is tightly coupled with the prevailing context at the time of the interaction. Despite the intervening time period, significant difficulties remain. This is particularity true when implicit interaction is considered in terms of human behavior analysis. Indeed, one promising approach in this area is that of Social Signal Processing [14]. However, for the purposes of this discussion, the focus will remain on the issue of situational context, and the potential of an intelligent distributed middleware approach will be discussed. 2.3 Models of Interaction Various models of interaction have been proposed in computational contexts, for example, those of Norman [15] and Beale [16]. Ultimately, all frameworks coalesce around the notions of input and output, though the humans and computer interpretation of each is not symmetrical. Obreovic and Starcevic [17] define input modalities as being either stream-based or event based. In the later case, discrete events are produced in direct response to user actions, for example, clicking a mouse. In the former case, a time-stamped array of values is produced. In the case of output modalities, these are classified as either static or dynamic according to the data presented to the users. Static responses would usually be presented in modal dialog boxes. Dynamic output may present as an animation - something that must be interpreted only after a time interval has elapsed.
Fig. 2. Interaction may be regarded as constituting both an implicit and explicit component
148
M.J. O’Grady et al.
For the purposes of this discussion, interaction is considered as composing a spectrum that incorporates an implicit and explicit component (Fig. 2) but in which one dominates. An explicit interaction may be reinforced or augmented by an implicit one, for example, smiling while selecting a menu option. Likewise, an implicit interaction may be supported by explicit one, an extreme case being somebody acting under duress where they are doing something but their body language clearly states that they would rather not be pursuing that course of action. Thus the difficulty from a computational perspective is to identify the dominant and subordinate elements of an interaction, and to ascribe semantic meaning to them. A key determinant of this is the context in which the interaction occurs.
3 Reasoning with Context If the context in which the user operates is fully understood, a successful interaction can take place. In practice, an incomplete state of the prevailing context is all that can be realistically expected in all but the simplest scenarios. Indeed, it is questionable as to whether it is possible to articulate all possible contextual elements for an arbitrary application [18]. Usually, software engineers will consider simpler forms of context, normally those that are easy to capture and interpret, and incorporate them into their designs. Though useful, these are just proxies for user intent [19] and are used in an effort to remedy the deficiency in understanding of what it is that the user is try to achieve. This deficiency obliges the software engineer to use incomplete models to best estimate the prevailing context, and to adapt system behavior accordingly. 3.1 Context Reasoning Not every piece of information about a user has an equal effect on the interaction. Low-level context can be enormous, trivial, vulnerable to small changes, and noisy. Therefore, higher-level contexts (or situations) are needed to derive from an amount of the low-level context, which will be more accurate, human-understandable, and interesting to applications. Earlier research on context attempted to use first-order logic to write reasoning rules, for example the work by Gu, et al. [20], Henricksen, et al. [21], and Chen, et al. [22]. More recently, ontological reasoning mechanisms have been adopted as the technology of choice to make reasoning more powerful, expressive, and precise [2324]. Currently, research focuses more on formalizing situation abstraction in terms of logic programming. Loke presents a declarative approach to representing and reasoning with situations at a high level of abstraction [25]. A situation is characterized by imposing constraints on the output or readings returned by sensors (Fig. 3). A situation occurs, when the constraints imposed on this situation are satisfied by the values returned by sensors. For example, an “in_meeting_now” situation occurs when a person is located with more than two persons and there is an entry for meeting in a diary. These constraints are represented as a logic program. This approach is based on the logical programming language LogicCAP that embeds situation programs in Prolog, and provides a high level of programming and reasoning situation for the developers.
A Middleware for Implicit Interaction
149
Fig. 3. Situations can be inferred from individual contexts harnessed from a suite of sensors
The logical theory makes it amenable to formal analysis, and decouples the inference procedures of reasoning about context and situations from the acquisition procedure of sensor readings. This modularity and separation of concerns facilitates the development of context-aware systems. 3.2 Context Uncertainty In terms of software, the error-prone nature of context and contextual reasoning alter the ways in which we must think about interaction and adaption. If a context is incorrectly reported, or is considered irrelevant to users, a problem will occur when a system makes a responsive action adapting to real-time contextual changes [26]. Resolving uncertainty in context has been a hot research topic in recent years. Henricksen et al. [27] refine the quality of context into five categories: 1. 2. 3. 4. 5.
incompleteness – if a context is missing; imprecision – if the granularity of a context is too coarsed; conflicting – if a context is inconsistent with another context; incorrectness – if a context contradicts with the real world state; out-of-dateness – if a context is not updated in response to changes.
Any decision may be made incorrectly on account of any type of poor input data quality. Beyond the quality in context, oversimplified reasoning mechanisms could introduce extra noise to inferred results. Anagnostopoulos et al. [28] define a fuzzy function to evaluate the degree of membership in a situational involvement that refers to the degree of belief that a user is involved in a predicted situation. They define Fuzzy Inference Rules (FIR) that are used to deal with imprecise knowledge about situational context and the user
150
M.J. O’Grady et al.
behaviour/reaction and historical context. Similarly Ye et al. [29] use the fuzzy function to integrate and abstract mulitple low-level contexts into high-level situations. The fuzzy function is used to evaluate how much the current context satisfies the constraints in a situation’s specification. Machine learning techniques are widely applied to deal with uncertainty issues in the inferring process. Bayesian networks have a causal semantics that encode the strength of causal relationships with probabilities between lower- and higher-level. Bayesian networks have been applied by Ranganathan et al. [30], Gu et al. [20], Ding et al. [31], Truong et al. [32], Dargie et al. [33], and Ye et al. [34]. For example, Gu et al. encoded probabilistic information in ontologies, converted the ontological model into a Bayesian network, and inferred higher-level contexts from the Bayesian network. Their work aimed to solve the uncertainty that is caused by the limit of sensing technologies and inaccuracy of the derivation mechanisms. Bayesian networks are best suited to applications where there is no need to represent ignorance and prior probabilities are available [35]. Any decision may be made incorrectly on account of errors in input data, and we simultaneously cannot blame poor performance on poor input data quality: we must instead construct models that accommodate uncertainty and error across the software system, and allow low-impact recovery [36]. 3.3 Intelligibility of Context Reasoning Interaction can be more useful if it is scrutable or intelligible. Intelligibility is defined as “an application feature including supporting users in understanding, or developing correct mental models of what a system is doing, providing explanations of why the system is taking a particular action, and supporting users in predicting how the system might respond to a particular input. ” [37]. On one hand, a system will make decisions by taking all input, explicit or implicit to users, from sensors embedded in an environment. It uses its knowledge base in reasoning, while it has limited ability in ruling out random input or understanding which input is more important than another in determining actions. On the other hand, a user may have little understanding of what a system considers to be input and why a particular action is taken. Making a system intelligible will benefit both the system and end-users. The system will provide a suitable interaction interface to users so as to explain its actions, while users can provide feedback through the interface so that the system can adjust its behavior and provide services that match users’ desire much better in the future.
4 Embedded Agents Embedded agents [38] offer an effective model for designing and implementing solutions that must capture and interpret context parameters from disparate and distributed sources. Such agents have been deployed in a variety of situations including user interface implementation on mobile devices [39], realizing an intelligent dormitory for students [40] and realizing mobile information systems for the tourism [41] and mobile commerce domains [42] respectively.
A Middleware for Implicit Interaction
151
4.1 The Agent Paradigm Research in intelligent agents has been ongoing for over two decades now. What it is that defines an agent is open to debate. For the purposes of this discussion, agents are considered, somewhat simplistically perhaps, to be one of two varieties – reactive and deliberative. The interested reader is referred to Wooldridge and Jennings [43] for a more systematic treatment of agents and agent architectures. Reactive agents respond to stimuli in their environment. An event occurs, the agent perceives it and responds using a predefined plan of action. Such agents are quite simple to design and implement. A prerequisite for their usage is that the key events or situations can be clearly defined and that an equivalent plan of action can be constructed for each circumstance. Such agents can be easily harnessed for explicit interaction modalities as their response time is quick. Deliberative agents reflect and reason before engaging in an action. Essential to their operation is a reasoning model; hence they may be demanding from a computational perspective and their response time may be unacceptable. Such agents maintain a model of both themselves and the environment they inhabit. Identifying changes in the environment enables them both to adapt to the new situation and affect changes within the environment in certain circumstances. Such agents are useful for implicit interaction in that they enable transparent monitoring of an end-user and their inherent reasoning ability allows to come to some decision about as to if and when an implicit interaction episode has occurred, and what the appropriate course of action might be. 4.2 Coordination and Collaboration As has been discussed, all interaction takes place within a context and an understanding of the prevalent context can usually make the meaning of the interaction itself more clear. However, gathering and interpreting select aspects of the prevalent contexts is process fraught with difficulty. And it is in addressing this that the agent paradigm can be harnessed to most benefit. Agents are inherently distributed entities. Coordination and collaboration are of fundamental importance to their successful operation. To this end, all agents share a common language or Agent Communication Language (ACL). Indeed, the necessity to support inter-agent communication has resulted in the development of an international ACL standard, which has been ratified by the Foundation for Intelligent Physical Agents (FIPA). FIPA has recently been subsumed into the IEEE computer society, forming an autonomous standards committee seeking to facilitate interoperability between agents and other non-agent technologies. 4.3 The Nature of the Embedded Agents Embedded Agents are lightweight agents that operate on devices of limited computational resources. Ongoing developments in computational hardware have resulted in agents become viable on resource limited platforms such as mobile telephones and nodes of Wireless Sensor Networks (WSNs). While such agents may be limited in what they can do on such platforms, it is important to remember that they can call upon other agents and resources if the physical communications medium supports an adequate QoS. Thus a multi-agent system may itself be composed of a heterogeneous
152
M.J. O’Grady et al.
suite of agents – some significantly more powerful than others but all collaborating to fulfill the task at hand. Such a model of an MAS is reflective of the diverse suite of hardware that is currently available and may be harnessed in diverse domains. As an example of how embedded agents might collaborate, consider the following scenario. While observing how a user interacts with an arbitrary software package, the user wipes their brow. This gesture, done subconsciously, is observed and identified. However, what does it mean in this context? It may indicate that the user is stressed or it may be a cue to indicate that the ambient office temperature is too high. In the later case, it would not be difficult to confer with an agent monitoring a temperature sensor to identify the current temperature and check whether it an average figure or maybe too high. If considered high, a request could be forwarded to the agent in charge of air-conditioning to reduce the ambient temperature. In the case where the user is stressed, and there may other cues to affirm this, the situation is more complicated. Is the user stressed because of the software or hardware itself? or because of the task they are trying to accomplish? or because of some other circumstance? And does it really matter? In some cases, it may be quite important to know if a user is stressed, especially if they operating a vital piece of equipment, for example in a medical theatre or air control context. As to what the equipment should do if it senses that its operator is under stress is debatable, and may even raise ethical issues. However, it can be reasonably conjectured that the team leader or project manager might find it useful to know that one of their team members could be having difficulty and that some active intervention, though precautionary, might be a prudent course of action. While agents encompass a suite of characteristics that make them an apt solution for identifying episodes of implicit interaction, a further level of abstraction would be desirable if implicit interaction is to become incorporated in mainstream computing. In the next section, we consider how this might achieved, specifically through the middleware construct.
5 Characteristics of a Middleware for Implicit Interaction In the previous sections, the issues of reasoning with uncertain contexts was discussed. Likewise, the agent paradigm was considered in light of its inherent distributed nature as a means for capturing and interpreting contextual states. Except in the simplest cases, interaction cannot be divorced from the context in which it occurs. Thus the key challenge that must be addressed is how to incorporate implicit interaction into the conventional software development lifecycle. Requirements Analysis Requirements analysis is concerned with the identification of what it is that either a new system or modified system is expected to do. Various techniques have been proposed for eliciting user requirements. In particular, the key stakeholders are identified and their needs specified. The question of how issues relating to interaction may be addressed depends on the approach adopted. During interviews, there is scope for identifying how users perceive interaction occurring and opportunities for incorporating alternative interaction modalities, including implicit modalities. Given the time and budgetary constraints that a project may labor under, it may be questionable as to
A Middleware for Implicit Interaction
153
what scope software engineers have to do this. Not only must they obtain a thorough understanding of what a proposed system must do but they must also gain and indepth of the target user group including their needs, backgrounds and expectations. Should the requirements phase of a project include rapid prototyping, a greater understanding of how potential users envisage interaction with the proposed system may be gleaned. In such circumstances, a mockup is constructed resulting in users seeing clearly how the interaction is planned and enabling the system designers to ascertain the possibility and desirability of incorporating an implicit interaction component. Whether the average software engineer is the best person for this task is an open question. In principle, such a task would be undertaken by usability professionals. In practice, such people may not become involved in the project until the next stage, if indeed at all. Design & Specification The objective here is to provide a systemic description of what a system will do. Naturally, all elements of how interaction will occur need to be agreed at this stage. First of all, there needs to be agreement on whether the interaction requirements would be best served by harnessing implicit interaction, or indeed, other interaction modalities. Various factors will influence this decision, for example, will the user base accept what they might perceive as non-conventional interaction modalities? More importantly, there may be a trade-off between system performance or responsiveness and what interaction modality is adopted. Such a trade-off needs to be quantified. The implications for project planning must also be considered. Though there may be an excellent usability case for supporting an arbitrary interaction modality, the time-scale, budget or deployment configuration may preclude their realization in the project. Implementation At this stage, all the key decisions have been made. It only remains for them to be implemented. From an implementation perspective, realizing implicit interaction is just another programming task that must be completed such that it adheres to the design. However, it must be stated that programmers and designers currently have little to aid them should a request to include implicit interaction be forthcoming. Thus the resultant solution, which may operate perfectly, is really an ad-hoc solution. If such interaction is to be incorporated into mainstream software development, a prerequisite will be that this process is transparent and easy to manage. At present, that is not the case. How this deficiency may be remedied is considered next. 5.1 Making Implicit Interaction Mainstream Conventional software development environments include a range of widgets with associated behaviors that programmers can use in their designs and implementations. Such widgets can be customized and adapted according to a range of policies and application-specific requirements. This is the prevalent approach adopted in mainstream computing where the interaction modality is predominantly explicit. Thus the principles underpinning this approach are well understood, and codes of best practice have been identified. This is not the case with implicit interaction.
154
M.J. O’Grady et al.
Once a decision has been made to either discard the classic WIMP interface, or even augment such interfaces with additional modalities, the creativity and ingenuity of the programmer will be required to craft a solution. It is worth reiterating that the model of interaction being adopted, at this stage of the software development process, will have been agreed and its behaviors specified. Thus the programmer has the singular task of implementing the design without necessarily being concerned with the merits or otherwise of the interaction modality being used. Their problem is that the lack of widgets will oblige them to create new solutions – a creative endeavor perhaps but one which may be costly in terms of time. Such a scenario may well be replicated in diverse projects; thus a key challenge is develop a framework that allow software developers incorporate implicit interaction seamlessly into their products. 5.2 Toward a Middleware for Implicit Interaction Implicit interaction may be unimodal or multimodal. Though an implicit interaction “event” may be said to have occurred, the event-driven model adopted in conventional software systems is not appropriate in this circumstance, at least not without modification, as users are not directly interacting with the system but rather doing so indirectly through a variety of behavioral cues. Though such cues are initiated by the user, they will not be communicated directly to the software system. Rather the software must act in a proactive manner to capture and interpret such cues, rather than just react to user stimuli. Thus developing a suite of APIs for capturing and interpreting a range of implicit interaction behaviors, though attractive, is not an option. A robust solution is called for and to this end, it is proposed that one based on the middleware concept offers one approach for enabling the seamless integration of implication interaction into conventional computing. Middleware has been conventionally viewed as a service provision layer that sits above the OS and networking layers but below domain specific applications [44]. Frequently seem as framework for ensuring interoperability, the middleware construct has been adopted in a diverse range of applications and domains, for example smart phones [45] and wireless sensor networks [46] offers a useful mechanism for providing a higher level of abstraction than that offered by conventional APIs. In the context of this discussion, it is instructive to note that middleware has been harnessed in the HCI domain. For example, Yaici and Kondoz [47] describe a middleware for the generation of adaptive user interfaces on resource-constrained mobile computing devices. Likewise Repo and Riekki [48] adopt a middleware approach for realizing contextaware multimodal user interfaces. Middleware offers an attractive framework for incorporating implicit interaction into mainstream computing. The framework itself may be implemented in a variety of ways. However, in light of the discussion on agents, it can be seen that agents encompass a suite of characteristics that make them a suitable basis for such a framework. Indeed, the framework could be designed such that it acts as a wrapper for a MultiAgent System. In this way, a standardized interface to the middleware could be provided to the software developer while the developers themselves are shielded from the intricacies of both MAS development and the effort required to capture and classify instances of implicit interaction.
A Middleware for Implicit Interaction
155
5.3 Case Study: The SIXTH Middleware Architecture An ongoing project in our laboratory concerns the deign and development of an middleware for sensor networks. SIXTH [49] takes a broad interpretation of what a sensor might actually entail. At its simplest, a sensor network might compromise a network of nodes, for example motes. However, sensors can vary significantly in their capabilities, and might include a range of artifacts that on first sight might appear to have little in common with the conventional view of what a sensor actually is. For example, a surveillance camera network is essentially a sensor network. Likewise, fabrics imbued with heart rate monitors and other physiological measuring instrumentation might comprise a sensor network.
Fig. 4. Constituent components of the SIXTH middleware architecture
SIXTH is motivated by two observations: 1. Practical sensor networks will be heterogeneous. This heterogeneity will be expressed in a number of ways. Specifically a range of sensors differing in capability, communications mechanisms and supporting a range of sensed modes will form
156
M.J. O’Grady et al.
networks that support a range of diverse applications. Only in specialized sensor applications, for example, environmental applications, will homogeneous networks be the norm. In the case of implicit interaction, it can be seen that a network of cameras and audio receivers would be essential just to capture vocal cutes, gestures and facial expressions. 2. Sensor networks must be usable. In essence, the functionality encapsulated in sensors must be abstracted in an intuitive fashion such that it can be harvested and used by a variety of service providers. Only in this way, will sensor networks become incorporated into mainstream computing applications and services. Thus SIXTH aims to encapsulate the following characteristics: − − − − − −
scalability; reusability; flexibility; openness; extensibility; modularity.
Figure 4 illustrates the key components of the SIXTH architecture. It comprises three core layers: 1. Adaptor Layer: This layer contains device specific adaptors that utilize the native resources on the individual sensor itself and exposes them to the higher layers of the middleware. 2. API Layer: This layer implements a set of device agnostic APIs that can be used (in principle) to interface with any deployed sensor device. It provides support for addressing, (re-) programming of sensors; discovery of devices, monitoring of devices, and data access. 3. Service Layer: This layer augments the basic functionality provided by the API Layer to deliver higher-level services that are tailored to the specific applications that require access to the underlying devices. Layers 1 and 2 are designed to address the issue of heterogeneity. Layer 3 provides a mechanism for integrating new services, enabling their transparent and intuitive use in a range of applications. At each layer, the components can be reused in many contexts to delivery multiple applications without the requirement for redevelopment of lower level functionality. The interface between the Adaptor and API Layers has been designed to embrace a multiplicity of abstractions that facilitate diverse modes of interaction with the embedded devices. Finally, SIXTH supports embedded intelligence, that is, support for in-situ reasoning via the deployment of intelligent agents. Any agent platform that can operate on a Java 2, Micro-edition (Java ME) platform, for example, Agent Factory Micro Edition (AFME) [50] will work with SIXTH.
A Middleware for Implicit Interaction
157
5.4 Application Domain: Ambient Assisted Living Ambient Assisted Living (AAL) [51] is concerned with the provision of technological assistance to people as they grow older. It was conceived in response to the growing realization that the demographic trends in many countries will result in many national populations being dominated by elderly people. This will have significant sociological and economic implications for future societies. Though the AAL concept encapsulates all aspects of daily life, it is the home environment that of most interest at present, and this is seen as the domain where it may prove most beneficial. Specifically, AAL is seen as a means of enabling the elderly live independently for longer than would otherwise be the case, with all the benefits that accrue from this. From an interaction perspective however, there is one key problem: older people tend to find interacting with technology difficult, for example, the common remote control may be perceived as being complex [52]. It must be stressed that this is primarily a usability issue, and that older people are not adverse to technologies or assistive technologies per se. We conjecture that the implicit interaction modality may offer a promising approach for addressing this problem. As an initial step toward validation, an AAL configuration was deployed in our laboratory. The objective was to investigate to what degree the SIXTH middleware would succeed in capturing various elements of the prevailing context such that a more thorough interpretation of any interaction might be obtained. A heterogeneous suite of sensors was deployed, including a range of motion and temperature sensors, accelerometers and pressure mats. These were attached to range of objects in the environment. SIXTH was harnessed for programming and configuring the sensor network. All data was routed to a standard database. Initial results demonstrated that SIXTH could indeed harvest data from the various sensors, and make it available for contextual analysis as originally envisaged. Though feasible, a number of deficiencies were identified that would hinder AAL systems in practice. In the first instance, only the SunSPOT platform [53] was capable of supporting agents. Thus harnessing the power of agents for in-situ decision-making was not fully exploited. It is envisaged that the next generation of sensor platforms will prove sufficiently sophisticated to harness the full power of the agent paradigm. In the second instance, it was observed that many commercial platforms were for the most part propriety and that they could not be seamlessly integrated into the AAL configuration by wrapping the sensor functionally within an individual agent as originally envisaged. This problem can only be addressed through a standardization initiative. Finally, the omnipresent problem of power management remains. Some sensor platforms, even when behaving in a simple stimulus/response manner, did not have sufficient battery power to operate satisfactory for even a week. In summary, while the potential of sensor technologies for capturing context and interpreting interaction events both in their explicit and implicit modalities is significant, it remains quite some way from fulfillment. In response to these observations, it should be stated that a new generation of sensors will improve performance and power efficiency possibly by an order of magnitude. The models of abstraction supported by SIXTH for enabling extensibility are currently being revised. Finally, a more sophisticated sensor platform based on the Tyndall platform [54] is being investigated.
158
M.J. O’Grady et al.
Supporting a wide range of sensors, this platform will enable further research in the broad area of context capture and agent-based collaborative analysis.
6 Conclusion As computation technologies permeate more areas of everyday life, the need for a range of interaction modalities will become increasingly urgent, particularly if the promise of seamless and intuitive interaction is to become a reality rather than the aspiration it is at present. This paper explored the concept of implicit interaction, explaining its genesis and reflecting on how it might be incorporated into mainstream, computing. Further basic research is needed into understanding what it is that defines implicit interaction. As a start, it may be feasible to develop a classification of nonverbal cues that people normally use and attempt to attach semantic meaning to them. A cultural perspective on these needs to be maintained as well. Furthermore, the computational effort that must be expended in capturing implicit interaction needs to be quantified, particularly if a range of embedded artifacts are used for interaction capture. Likewise the time expended both in capturing and interpreting must be quantified so that an adequate response time can be estimated thus ensuring the quality of the user experience is maintained. Only when a more thorough understanding of the underlying principles is obtained can the practical issue of service implementation be considered. Acknowledgements. This work is supported by Science Foundation Ireland (SFI) under grant 07/CE/I1147.
References 1. Maybury, M.T., Wahlster, W. (eds.): Readings in Intelligent User Interfaces. Morgan Kaufmann Publishers Inc., San Francisco (1998) 2. Argyle, M., Slater, V., Nicholson, H., Williams, M., Burgess, P.: The Communication of Inferior and Superior Attitudes by Verbal and Non-verbal Signals. British Journal of Social and Clinical Psychology 9, 221–231 (1970) 3. Ekman, P., Friesen, W.V.: The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding. Semiotica 1, 49–97 (1969) 4. Ambady, N., Rosenthal, R.: Thin Slices of Expressive Behavior as Predictors of Interpersonal Consequences: A meta-analysis. Psychological Bulletin 1(11), 256–274 (1992) 5. Ekman, P., Rosenberg, E.L. (eds.): What the Face Reveals: Basic and Applied Studies of Spontaneous Expression using the FACS. Oxford University Press, Oxford (2005) 6. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., Taylor, J.: Emotion Recognition in Human-Computer Interaction. IEEE Signal Processing Magazine 18(1), 32–80 (2001) 7. Oviatt, S.: Multimodal Interactive Maps: Designing for Human Performance. HumanComputer Interaction 12(1), 93–129 (1997) 8. Sebe, N.: Multimodal Interfaces: Challenges and Perspectives. Journal of Ambient Intelligence and smart environments 1(1), 23–30 (2009)
A Middleware for Implicit Interaction
159
9. Kuchenbecker, K.J., Fiene, J., Niemeyer, G.: Improving Contact Realism through Eventbased Haptic Feedback. IEEE Transactions on Visualization and Computer Graphics 12(2), 219–230 (2006) 10. Charumporn, B., Omatu, S.: Classifying Smokes using an Electronic Nose and Neural Networks. In: Proceedings of the 41st SICE Annual Conference, vol. 5, pp. 2661–2665 (2002) 11. Ciosek, P., Wróblewski, W.: Sensor Arrays for Liquid Sensing – Electronic Tongue Systems. Analyst 132, 963–978 (2007) 12. Johnston, M., Bangalore, S.: Finite-state Multimodal Integration and Understanding. Journal of Natural Language Engineering 11(2), 159–187 (2005) 13. Schmidt, A.: Implicit Human Computer Interaction through Context. In: Personal Technologies, vol. 4(2&3), pp. 191–199. Springer, Heidelberg (2000) 14. Vinciarelli, A., Pantic, M., Bourlard, H.: Social signal processing: Survey of an emerging domain. Image and Vision Computing 27(12), 1743–1759 (2009) 15. Norman, D.A.: The Design of Everyday Things. Doubleday (1989) 16. Dix, A., Finley, J., Abowd, G., Beale, R.: Human–Computer Interaction, 3rd edn. PrenticeHall, Englewood Cliffs (2004) 17. Obrenovic, Z., Starcevic, D.: Modeling Multimodal Human–Computer Interaction. IEEE Computer 37(9), 65–72 (2004) 18. Greenberg, S.: Context as a Dynamic Construct. Human-Computer Interaction 16(2–4), 257–268 (2001) 19. Dey, A.: Modeling and Intelligibility in Ambient Environments. Journal of Ambient Intelligence and smart environments 1(1), 57–62 (2009) 20. Gu, T., Pung, H.K., Zhang, D.Q.: A Bayesian Approach for Dealing with Uncertain Contexts. In: Proceedings of Advances in Pervasive Computing, pp. 205–210 (2004) 21. Henricksen, K., Indulska, J.: Developing Context-aware Pervasive Computing Applications: Models and Approach. Pervasive and Mobile Computing 2(1), 37–64 (2006) 22. Chen, H., Finin, T., Joshi, A.: An Ontology for Context-Aware Pervasive Computing Environments. Knowledge Engineering Review 18(3), 197–207 (2004) 23. Ranganathan, A., Campbell, R.: An Infrastructure for Context-awareness Based on First Order Logic. Personal Ubiquitous Computing 7(6), 353–364 (2003) 24. Ye, J., Coyle, L., Dobson, S., Nixon, P.: Ontology-based Models in Pervasive Computing Systems. Knowledge Engineering Review 22(4), 315–347 (2007) 25. Loke, S.W.: Representing and Reasoning with Situations for Context-aware Pervasive Computing: A Logic Programming Perspective. Knowledge Engineering Review 19(3), 213–233 (2004) 26. Schilit, B., Adams, N., Want, R.: Context-Aware Computing Applications. In: Workshop on Mobile Computing Systems and Applications, pp. 85–90. IEEE, New York (1994) 27. Henricksen, K., Indulska, J.: Modelling and Using Imperfect Context Information, pp. 33–37. IEEE, New York (2004) 28. Anagnostopoulos, C.B., Ntarladimas, Y., Hadjiefthymiades, S.: Situational Computing: An Innovative Architecture with Imprecise Reasoning. System and Software 80(12), 1993–2014 (2007) 29. Ye, J., McKeerver, S., Coyle, L., Neely, S., Dobson, S.: Resolving Uncertainty in Context Integration and Abstraction. In: Proceedings of the International Conference on Pervasive Services, pp. 131–140. ACM, New York (2008) 30. Ranganathan, A., Al-Muhtadi, J., Campbell, R.: Reasoning about Uncertain Contexts in Pervasive Computing Environments. IEEE Pervasive Computing 3(2), 1268–1536 (2004) 31. Ding, Z., Peng, Y.: A Probabilistic Extension to Ontology Language OWL. In: Proceedings of the 37th Hawaii International Conference on System Sciences, pp. 1–10 (2004) 32. Truong, B.A., Lee, Y.-K., Lee, S.-Y.: Modeling Uncertainty in Context-Aware Computing. In: Proceedings of the Fourth Annual ACIS International Conference on Computer and Information Science (ICIS 2005), pp. 676–681 (2005)
160
M.J. O’Grady et al.
33. Dargie, W.: The Role of Probabilistic Schemes in Multisensor Context-Awareness. In: Proceedings of Fifth Annual IEEE International Conference on Pervasive Computing and Communications Workshops, pp. 27–32 (2007) 34. Ye, J., Coyle, L., Dobson, S., Nixon, P.: Using Situation Lattices to Model and Reason about Context. In: Proceedings of the Workshop on Modeling and Reasoning Context, pp. 1–12 (2007) 35. Hoffman, J.C., Murphy, R.R.: Comparison of Bayesian and Dempster-Shafer theory for sensing: a practitioner’s approach. In: Proceedings of Neural and Stochastic Methods in Image and Signal Processing II, pp. 266–279 (1993) 36. Ye, J., Dobson, S., Nixon, P.: An Overview of Pervasive Computing Systems. In: Augmented Materials and Smart Objects: Building Ambient Intelligence through Microsystems Technology, pp. 3–17. Springer, Heidelberg (2008) 37. Dey, A.: Modeling and Intelligibility in Ambient Environments. Journal of Ambient Intelligence and Smart Environments 1, 57–62 (2009) 38. O’Hare, G.M.P., O’Grady, M.J., Muldoon, C., Bradley, J.F.: Embedded Agents: A Paradigm for Mobile Services. Int. Journal of Web and Grid Services 2(4), 379–405 (2006) 39. O’Hare, G.M.P., O’Grady, M.J.: Addressing Mobile HCI Needs through Agents. In: Paternó, F. (ed.) Mobile HCI 2002. LNCS, vol. 2411, pp. 311–314. Springer, Heidelberg (2002) 40. Hagras, H., Callaghan, V., Colley, M., Clarke, G., Pounds-Cornish, A., Duman, H.: Creating an Ambient-Intelligence Environment Using Embedded Agents. IEEE Intelligent Systems 19(6), 12–20 (2004) 41. O’Grady, M.J., O’Hare, G.M.P., Sas, C.: Mobile Agents for Mobile Tourists: A User Evaluation of Gulliver’s Genie. Interacting with Computers 17(4), 343–366 (2005) 42. Keegan, S., O’Hare, G.M.P., O’Grady, M.J.: EasiShop: Ambient Intelligence Assists Everyday Shopping. Information Sciences 178(3), 588–611 (2008) 43. Wooldridge, M., Jennings, N.R.: Intelligent Agents: Theory and Practice. The Knowledge Engineering Review 10(2), 115–152 (1995) 44. Bernstein, P.A.: Middleware: A Model for Distributed System Services. Communications of the ACM 39(2), 86–98 (1996) 45. Riva, O., Kangasharju, J.: Challenges and Lessons in Developing Middleware on Smart Phones. Computer 41, 23–31 (2008) 46. Fok, C., Roman, G., Lu, C.: Mobile Agent Middleware for Sensor Networks: An Application Case Study.In: IPSN (2005) 47. Yaici, K., Kondoz, A.: Runtime Middleware for the Generation of Adaptive User Interfaces on Resource-constrained Devices. In: Third International Conference on Digital Information Management, pp. 587–592 (2008) 48. Repo, P., Riekki, J.: Middleware Support for Implementing Context-aware Multimodal User Interfaces. In: Proceedings of the 3rd International Conference on Mobile and Ubiquitous Multimedia( MUM 2004), pp. 221–227. ACM, New York (2004) 49. Tynan, R., O’Hare, G.M.P., O’Grady, M.J.: Agency, Ambience, Assistance: A Framework for Practical AAL. In: Augusto, J.C., Corchado, J.M., Novais, P., Analide, C. (eds.) Advances in Soft Computing, vol. 72, pp. 209–212. Springer, Heidelberg (2010) 50. Muldoon, C., O’Hare, G.M.P., Collier, R., O’Grady, M.J.: Towards Pervasive Intelligence: Reflections on the Evolution of the Agent Factory Framework. In: Bordini, R.H., Dastani, M., Dix, J., Fallah-Seghrouchni, A.E. (eds.) Multi-Agent Programming: Languages, Platforms and Applications, pp. 187–210. Springer, Heidelberg (2009) 51. Costa, R., Carneiro, D., Novais, P., Lima, L., Machado, J., Marques, A., Neves, J.: Ambient Assisted Living. In: Advances in Soft Computing, vol, pp. 86–94. Springer, Heidelberg (2008)
A Middleware for Implicit Interaction
161
52. Bernhaupt, R., Obrist, M., Weiss, A., Beck, E., Tscheligi, M.: Trends in the living room and beyond. In: Cesar, P., Chorianopoulos, K., Jensen, J.F. (eds.) EuroITV 2007. LNCS, vol. 4471, pp. 146–155. Springer, Heidelberg (2007) 53. Smith, R.B.: SPOTWorld and the Sun SPOT. In: Proceedings of the 6th International Conference on Information Processing in Sensor Networks, pp. 565–566 (2007) 54. Walsh, M., O’Grady, M.J., Dragone, M., Tynan, R., Ruzzelli, A., Barton, J., O’Flynn, B., O’Hare, G.M.P., O’Mathuna, C.: The CLARITY Modular Ambient Health and Wellness Measurement Platform. In: Proceedings of the Fourth International Conference on Sensor Technologies and Applications (SENSORCOMM), pp. 577–583 (2010)
Author Index
Aguiar, Steven K. 107 Alam´ an, Xavier 119
Masakowski, Yvonne R. Muldoon, C. 143
Becerra, Victor M.
Nasuto, Slawomir J.
Cai, Yang Collier, R.
16, 47 143
Dobson, S.
143
1
Pados, K´ aroly D. 119
Sonntag, Daniel Tynan, R.
Hart, Emily 47 Haya, Pablo A. 119 Hu, Yongmei 47, 58
Kaufer, David
1
O’Grady, M.J. 143 O’Hare, G.M.P. 143
Garc´ıa–Herranz, Manuel
Ishizaki, Suguru
107
58 47, 58
16 82
143
Vernhes, Pierre
35
Warwick, Kevin 1 Whalley, Benjamin J. Whitmore, Paul 35 Ye, J.
143
1