Game Sound Technology and Player Interaction: Concepts and Developments Mark Grimshaw University of Bolton, UK
InformatIon scIence reference Hershey • New York
Director of Editorial Content: Director of Book Publications: Acquisitions Editor: Development Editor: Publishing Assistant: Typesetter: Production Editor: Cover Design:
Kristin Klinger Julia Mosemann Lindsay Johnston Joel Gamon Milan Vracarich Jr. Natalie Pronio Jamie Snavely Lisa Tosheff
Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail:
[email protected] Web site: http://www.igi-global.com Copyright © 2011 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Game sound technology and player interaction : concepts and development / Mark Grimshaw, editor. p. cm. Summary: "This book researches both how game sound affects a player psychologically, emotionally, and physiologically, and how this relationship itself impacts the design of computer game sound and the development of technology"-- Provided by publisher. Includes bibliographical references and index. ISBN 978-1-61692-828-5 (hardcover) -- ISBN 978-1-61692-830-8 (ebook) 1. Computer games--Design. 2. Sound--Psychological aspects. 3. Sound--Physiological effect. 4. Human-computer interaction. I. Grimshaw, Mark, 1963QA76.76.C672G366 2011 794.8'1536--dc22 2010035721 British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.
Editorial Advisory Board Theo van Leeuwen, University of Technology, Australia Gareth Schott, University of Waikato, New Zealand
List of Reviewers Thomas Apperley, University of New England, Australia Roger Jackson, University of Bolton, England Martin Knakkergaard, University of Aalborg, Denmark Don Knox, Glasgow Caledonian University, Scotland Theo van Leeuwen, University of Technology, Sydney, Australia David Moffat, Glasgow Caledonian University, Scotland Patrick Quinn, Glasgow Caledonian University, Scotland Gareth Schott, University of Waikato, New Zealand
Table of Contents
Foreword ............................................................................................................................................. xii Preface ................................................................................................................................................ xiv Acknowledgment ................................................................................................................................. xx Section 1 Interactive Practice Chapter 1 Sound in Electronic Gambling Machines: A Review of the Literature and its Relevance to Game Sound ...................................................................................................................... 1 Karen Collins, University of Waterloo, Canada Holly Tessler, University of East London, UK Kevin Harrigan, University of Waterloo, Canada Michael J. Dixon, University of Waterloo, Canada Jonathan Fugelsang University of Waterloo, Canada Chapter 2 Sound for Fantasy and Freedom... ........................................................................................................ 22 Mats Liljedahl, Interactive Institute, Sonic Studio, Sweden Chapter 3 Sound is Not a Simulation: Methodologies for Examining the Experience of Soundscapes................................................................................................................... 44 Linda O’ Keeffe, National University of Ireland, Maynooth, Ireland Chapter 4 Diegetic Music: New Interactive Experiences... ................................................................................... 60 Axel Berndt, Otto-von-Guericke University, Germany
Section 2 Frameworks & Models Chapter 5 Time for New Terminology? Diegetic and Non-Diegetic Sounds in Computer Games Revisited... ................................................................................................................................ 78 Kristine Jørgensen, University of Bergen, Norway Chapter 6 A Combined Model for the Structuring of Computer Game Audio...................................................... 98 Ulf Wilhelmsson, University of Skövde, Sweden Jacob Wallén, Freelance Game Audio Designer, Sweden Chapter 7 An Acoustic Communication Framework for Game Sound: Fidelity, Verisimilitude, Ecology ............................................................................................................................................... 131 Milena Droumeva, Simon Fraser University, Canada Chapter 8 Perceived Quality in Game Audio ...................................................................................................... 153 Ulrich Reiter, Norwegian University of Science and Technology, Norway Section 3 Emotion & Affect Chapter 9 Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games... ........................... 176 Paul Toprac, Southern Methodist University, USA Ahmed Abdel-Meguid, Southern Methodist University, USA Chapter 10 Listening to Fear: A Study of Sound in Horror Computer Games... .................................................. 192 Guillaume Roux-Girard, University of Montréal, Canada Chapter 11 Uncanny Speech.................................................................................................................................. 213 Angela Tinwell, University of Bolton, UK Mark Grimshaw, University of Bolton, UK Andrew Williams, University of Bolton, UK Chapter 12 Emotion, Content, and Context in Sound and Music.......................................................................... 235 Stuart Cunningham, Glyndŵr University, UK Vic Grout, Glyndŵr University, UK Richard Picking, Glyndŵr University, UK
Chapter 13 Player-Game Interaction Through Affective Sound... ........................................................................ 264 Lennart E. Nacke, University of Saskatchewan, Canada Mark Grimshaw, University of Bolton, UK Section 4 Technology Chapter 14 Spatial Sound for Computer Games and Virtual Reality... ................................................................. 287 David Murphy, University College Cork, Ireland Flaithrí Neff, Limerick Institute of Technology, Ireland Chapter 15 Behaviour, Structure and Causality in Procedural Audio... ................................................................ 313 Andy Farnell, Computer Scientist, UK Chapter 16 Physical Modelling for Sound Synthesis... ......................................................................................... 340 Eoin Mullan, Queen’s University Belfast, N. Ireland Section 5 Current & Future Design Chapter 17 Guidelines for Sound Design in Computer Games... .......................................................................... 362 Valter Alves, University of Coimbra, Portugal & Polytechnic Institute of Viseu, Portugal Licínio Roque, University of Coimbra, Portugal Chapter 18 New Wine in New Skins: Sketching the Future of Game Sound Design... ........................................ 384 Daniel Hug, Zurich University of the Arts, Switzerland Appendix..... ....................................................................................................................................... 416 Compilation of References ............................................................................................................... 427 About the Contributors .................................................................................................................... 467 Index ................................................................................................................................................... 473
Detailed Table of Contents
Foreword ............................................................................................................................................. xii Preface ................................................................................................................................................ xiv Acknowledgment ................................................................................................................................. xx Section 1 Interactive Practice Chapter 1 Sound in Electronic Gambling Machines: A Review of the Literature and its Relevance to Game Sound ...................................................................................................................... 1 Karen Collins, University of Waterloo, Canada Holly Tessler, University of East London, UK Kevin Harrigan, University of Waterloo, Canada Michael J. Dixon, University of Waterloo, Canada Jonathan Fugelsang University of Waterloo, Canada An analysis of the music and sound used in electronic gambling machines. The psychology at play is discussed: how sound is used to create a sense of winning and how such specific sound design might be useful to computer game sound design in general. Chapter 2 Sound for Fantasy and Freedom... ........................................................................................................ 22 Mats Liljedahl, Interactive Institute, Sonic Studio, Sweden The relationship between sound and image in computer games and how, in a reversal of the normal situation, sound can be given priority over the visual. The rationale for such a reversal is demonstrated through practical game design examples. Chapter 3 Sound is Not a Simulation: Methodologies for Examining the Experience of Soundscapes................................................................................................................... 44 Linda O’ Keeffe, National University of Ireland, Maynooth, Ireland
What is the relationship between player and the game’s soundscape? How elements of the soundscape are perceived by the player is explained through the principles and theories of acoustic ecology. Chapter 4 Diegetic Music: New Interactive Experiences... ................................................................................... 60 Axel Berndt, Otto-von-Guericke University, Germany An analysis of diegetic music in games and, in particular, an assessment of issues of interaction and algorithmic performance. A framework is proposed that aids in the design of both individual and social musical performance paradigms into music games. Section 2 Frameworks & Models Chapter 5 Time for New Terminology? Diegetic and Non-Diegetic Sounds in Computer Games Revisited... ................................................................................................................................ 78 Kristine Jørgensen, University of Bergen, Norway The terms diegetic and non-diegetic are widely used in the analysis games (and not solely for sound). A thorough analysis of the application of the terminology to computer game sound is provided resulting in a new model that accounts for the interactive nature of the medium. Chapter 6 A Combined Model for the Structuring of Computer Game Audio...................................................... 98 Ulf Wilhelmsson, University of Skövde, Sweden Jacob Wallén, Freelance Game Audio Designer, Sweden A framework for the analysis and design of computer game sound is provided that builds upon existing frameworks in games and film. A practical example demonstrates the models utility. Chapter 7 An Acoustic Communication Framework for Game Sound: Fidelity, Verisimilitude, Ecology ............................................................................................................................................... 131 Milena Droumeva, Simon Fraser University, Canada Soundscape and communication theories are used to assess the computer game’s soundscapes and the ways in which the player perceives it. Different codes of realism are discussed and a model of the player and soundscape combined as acoustic ecology is proposed. Chapter 8 Perceived Quality in Game Audio ...................................................................................................... 153 Ulrich Reiter, Norwegian University of Science and Technology, Norway
Perceptual bi-modality and cross-modality between auditory and visual stimuli is discussed in addition to issues of realism and verisimilitude. A design model is put forward that assesses audio quality in computer games on the basis of player interactivity and attention. Section 3 Emotion & Affect Chapter 9 Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games... ........................... 176 Paul Toprac, Southern Methodist University, USA Ahmed Abdel-Meguid, Southern Methodist University, USA An overview of relevant emotion theories and their potential application to sound design for computer games. In particular, discussion centres around the eliciting of fear and anxiety during gameplay and the results of experiments in this area are discussed. Chapter 10 Listening to Fear: A Study of Sound in Horror Computer Games..................................................... 192 Guillaume Roux-Girard, University of Montréal, Canada A thorough analysis of sound design and sound perception in the survival horror game genre that focuses upon sound’s ability to instil fear and dread in the play. An analytical model of sound design is proposed that is founded upon the reception of sound, rather than production, and the use of the model is illustrated through several practical examples. Chapter 11 Uncanny Speech.................................................................................................................................. 213 Angela Tinwell, University of Bolton, UK Mark Grimshaw, University of Bolton, UK Andrew Williams, University of Bolton, UK An exploration of the genesis of the Uncanny Valley theory and its implications for the design and perception of Non-Player Character speech in horror computer games. Empirical work by the authors’ on the perception of such speech is discussed particularly with regard to the evocation of fear and anxiety. Chapter 12 Emotion, Content, and Context in Sound and Music.......................................................................... 235 Stuart Cunningham, Glyndŵr University, UK Vic Grout, Glyndŵr University, UK Richard Picking, Glyndŵr University, UK A summary of emotion research and its relevance to the design of sound and music for computer games is provided before a discussion on the use and effect of musical playlists during gameplay. In particular,
such playlists can be generated automatically according to the real-world environment the player plays in and according to the player’s changing psychology and physiology. Chapter 13 Player-Game Interaction Through Affective Sound... ........................................................................ 264 Lennart E. Nacke, University of Saskatchewan, Canada Mark Grimshaw, University of Bolton, UK An assessment of the role and efficacy of psychological, physiological, and psychophysiological measurements of players exposed to sound and music during gameplay. Recent empirical results from a psychophysiological study on computer game sound is presented followed by a discussion on the implications of biofeedback for game sound design and player immersion. Section 4 Technology Chapter 14 Spatial Sound for Computer Games and Virtual Reality... ................................................................. 287 David Murphy, University College Cork, Ireland Flaithrí Neff, Limerick Institute of Technology, Ireland An introduction to spatial sound, its application to computer games and the technological challenges inherent in emulating real-world spatial acoustics in virtual worlds. A variety of current technologies are assessed as to their strengths and weaknesses and suggestions made as to the requirements of future technology. Chapter 15 Behaviour, Structure and Causality in Procedural Audio... ................................................................ 313 Andy Farnell, Computer Scientist, UK A critical assessment of the current use of audio samples for computer games from the point of view of creativity and realism in game sound design. Procedural audio is proposed instead and the strengths and opportunities afforded by such a technology is discussed. Chapter 16 Physical Modelling for Sound Synthesis... ......................................................................................... 340 Eoin Mullan, Queen’s University Belfast, N. Ireland A review of the potential for computer game sound design of one branch of procedural audio viz. physical modelling synthesis. The historical evolution of the process is traced leading to a discussion of how such synthesis might be integrated into game engines and the implications for player interaction.
Section 5 Current & Future Design Chapter 17 Guidelines for Sound Design in Computer Games... .......................................................................... 362 Valter Alves, University of Coimbra, Portugal & Polytechnic Institute of Viseu, Portugal Licínio Roque, University of Coimbra, Portugal A discussion of the relevance and importance of sound to the design of computer games with particular regard to the concepts resonance and entrainment. Seven guidelines for game sound design are presented and exemplified through an illustrative example of a game design brief. Chapter 18 New Wine in New Skins: Sketching the Future of Game Sound Design... ........................................ 384 Daniel Hug, Zurich University of the Arts, Switzerland The aesthetic debt that computer game sound owes to film sound is described as a prelude to a variety of examples from independent game developers going beyond such a paradigm in their sound design. Suggestions are made as to how game sound design might evolve in the future to take greater account of the interactive potential inherent in the structure of computer games. Appendix..... ....................................................................................................................................... 416 Compilation of References ............................................................................................................... 427 About the Contributors .................................................................................................................... 467 Index ................................................................................................................................................... 473
xii
Foreword
BANG! There, that got your attention. OK, so that’s a fairly bad joke to illustrate just what sound can do for you… namely, GET YOUR ATTENTION! Actually, sound does so much more: it connects your visual input to a frame of reference, the audio-visual contract. So, when we create experiences, either in film, TV, live on stage, or in computer games, we use this cerebral connection between sound and vision to intensify your overall experience. Because, that’s our goal in any of these mediums–to create an experience! Sound takes up 50% of this experience (maybe not 50% of the budget, but that’s another story). There’s an old adage we audiophiles use when discussing budgets in the hope that a producer might actually listen to us once in a while. If you get a room full of people to watch great graphics with poor sound and then compare it to poor graphics with great sound, they will almost always perceive the latter as the best quality graphics. Generally producers don’t believe this story, but I have witnessed it in real life. A few years ago I was working on an AAA title–action adventure: cars, guns, gangsters… you get the idea. One evening, the sound designer reworked the “Whacking someone over the head with a pool cue” sound, improving its overall effectiveness with small, subtle, deep thuds, some crunching bone (actually carrots), and a deliciously realistic skin smacking sound (supermarket chicken being hit by a baseball bat). He added his new sound to the game database and went home. The following morning the game team rebuilt the whole game (including the new sound). Later that day many people congratulated the “Whacking someone over the head with a pool cue” animator on his new improved animation: he was somewhat bemused to say the least. He hadn’t worked on that animation for several weeks. I’m sure you can work out what happened, people saw the same animation with the new improved sound and believed they were seeing a better animation. This is how we use the audio-visual contract to our benefit. OK, so that’s my practitioner's story in, but let’s take a look at game sound and what you need to study if you are interested in this field… and what’s in this book. There are several axes or dimensions to think about. Emotion is the obvious one: fear, anger, hatred and so on, these are all well represented in game sound, from survival horrors to gangster simulations. But what about humour, joy, happiness? Just play Mario Kart, Sonic the Hedgehog, Loco Roco and I guarantee you’ll soon realise that the sound has a great deal to do with provoking laughter, smiles, and an enlightened mood. So, I’ve now mentioned the breadth of experience our industry creates, but think too of another axis, the history of game sound. From tiny little beeps and bleeps (Pong) to the colossus soundscapes of today’s blockbuster games. A story which starts with a few programmers/musicians/sound engineers trying to get “something” out of a paltry 8-bit chip after the graphics guys have already had their fill, through to my point at the beginning of this introduction–persuading a producer to give you some kind of
xiii
serious sound budget. A tale of one guy who does everything (including the voice over) to a small army of specialists from musicians, Foley artists, sound technicians, weapons specialists, vehicle specialists, atmosphere creators, the list goes on. Our game sound pioneers took this journey and, along the way, solved some tricky issues, like repetition–in music, in dialogue, in sound effects–memory management, automated in-game mixing and so on. I am going to sum this section up by saying there are now many different aspects to game sound: music, diegetic sound, atmospheres, interactive music, development of emotional connection, realism, abstractism, super-realism. What I really like about this book’s approach to game sound is the 5 core sections which give it a unique and very practical way of tying together all the axes I mentioned earlier, namely: Interactive Practice, Frameworks & Models, Emotion & Affect, Technology, and Current & Future Design. In conclusion, then, I hope that you, as a reader, enjoy the discussion and findings discussed here as much as I have. Dave Ranyard Dave Ranyard is the Game Director/Executive Producer of Sony’s hugely successful, 20+ million selling SingStar franchise. He has been in the games industry since the mid nineties, starting out as an AI programmer at Psygnosis, and later moving to Sony Computer Entertainment Europe's London Studio where he has held a number of roles over the past 10 years, ranging from audio manager to running the internal creative services group. He has worked on titles including Wip3out, The Getaway & The Getaway: Black Monday, The Eyetoy: Play series and, more recently, Singstar. Prior to the games industry he lectured in Artificial Intelligence at the University of Leeds where he also gained a PhD in the subject. In recent years, Dave has taken a keen interest in GDC and is currently on the advisory board. Dave is a keen musician and he has written and produced many records over the past 15 years.
xiv
Preface
A phrase often used when writing about the human ability to become immersed in fantasy is “the willing suspension of disbelief” which Samuel Taylor Coleridge first coined in the early 19th Century as an argument for the fantastical in prosody and poetry. What is a computer game? At base, it is nothing more than a cheap plastic disc encased within a cheaper plastic tray. And the system it is destined for? A box of electronics, lifeless in a corner. Put the two together, though, throw in the player's imagination and interaction and he or she is delivered of experiences that, to use Diderot's phrase, are “the strongest magic of art”. Disbelief is suspended willingly, sense and rationality recede, and the player becomes engaged with, engrossed in, and, given the appropriate game, immersed in a virtual world of flickering light and alluring sound where the fantastical becomes the norm and the mythic reality. For the reader interested in that flickering light, there is a plethora of books and scholarly articles on the subject. For the reader interested in the ins and outs of music and sound software, how to rig a microphone to record sound, and how to transfer that sound to a game environment, there likewise is a wealth of handy resources. For the reader truly interested in understanding or harnessing the power of sound in that virtual world, in emulating reality or the creation of other realities, in engaging, engrossing and immersing the player through sound and emotion, there is this book. This is a book that deals with computer game sound in a variety of forms and from a variety of viewpoints. Sound FX, rather than game music is the topic, other than where the music is interactive or otherwise intimately bound up with the playing of the game. Such sound FX may be used to emulate acoustic environments of the real world while others deliberately set out to create alternate realities, some are based upon the use of audio samples whilst others are starting to make use of procedural synthesis and audio processing, some sound works hand-in-hand with image and game action to immerse the game player in the gameworld while interactive music in other cases is the sole raison d’être of the game. From the simplest of puzzle games to the most detailed and convoluted of gameworlds, sound is the indicator par excellence of player engagement and interaction with the structures of the game and the rules of play. Academic writing about game sound, its analytical and theoretical drivers , is a developing area and this is reflected by the diversity of theoretical methodologies and the variety of terminology in use. Far from being a weakness, this range points to the potential for the discipline and the wide appeal of its study because it is, at heart, multidisciplinary. The range of subject matter across the chapters reflects the complexity and potential of human interaction with sound in virtual worlds as much as it reflects the passions, backgrounds, and training of the book's contributors. Their contributions to the study of computer game sound bring in disciplines and theories from film studies, cultural studies, sound design, acoustic ecology, acoustics, systems design and computer programming and cognitive sciences and psy-
xv
chology. The authors themselves have a diversity of experience: some are researchers and academics whilst others are sound practitioners in the games industry. All are experts in their chosen field yet all are students of game sound, forever exploring, forever questioning, forever seeking to drive the study and practice forward. The readership of this book is intended to be similarly diverse in terms of both discipline and motivation. There is something for everyone here: the student for whom a knowledge of computer game sound leads to that important qualification , a game sound designer wishing to keep abreast of the latest thinking and developmental concepts, or an academic theoretician or researcher working to innovate game sound theory or technology. Furthermore, the appeal of the book is wider than computer games, reaching out to those working in virtual reality or with autism, for example. The reader will not find screeds of instructions for software or hardware, programming recipes or tips on how to break into the industry. Instead, contained within this book, will be found lucid essays on philosophical questions, theoretical analyses on aspects of computer game sound, models for conceptualizing sound, ideas for sound design, and provocative discussions about new sound technology and its future implications. All chapters raise further questions as to the fascinating relationship between player and sound. Reflecting the disciplines the authors come from, some key terms (found at the back of each chapter) are provided with definitions that, prima facie, differ slightly to the definition provided for the same key term in another chapter. As with the authors' preferences for American or British English, this has been allowed to stand in order to illustrate both the diversity of approach to the topic throughout the book and the educational and professional backgrounds of each author. The study of computer game sound is yet young and the terminology and its application still in flux: the definition for each key term, where minor differences exist, pertains to the chapter the key term belongs to. The term “computer game” has been chosen, in preference to a number of other possibilities, as referring to all forms of digital game, arcade machine, gaming console, PC game, or videogame and the reader may assume that, unless a chapter uses one of those specific forms, “computer game” references the general case. Quite deliberately, the term has been chosen in preference to videogame in order to fly the flag for sound: videogames are not just video but sound too and all chapters proselytize the importance of sound to the game experience even where they reference the relationship of sound to image. “Sound” has generally been chosen in preference to “audio” because the focus of the book is on the relationship between sound and player rather than techniques for creating and manipulating audio data. However, “audio” is the usual term in some disciplines and, here, authors have been given free reign to use whichever terminology they are comfortable with. The book itself is organized into five sections. None is mutually exclusive in terms of its content. Indeed, the astute reader will pick up divers common threads meandering their way through the chapters: the debt game sound owes to film sound and the need to slough off that used skin, issues of presence and player immersion, realism, the unique, interactive nature of computer game sound, and the potential for the emotional manipulation of the player, for instance. All chapters, too, have an eye on the future and its possibilities and authors have been encouraged to speculate on that future. An oft-overlooked area in computer gaming (and certainly not the first thing that comes to mind with the term “computer game”) is that of electronic gambling machines: one-armed bandits and their modern equivalents. Karen Collins and her co-authors open the first section on Interactive Practice by providing a fascinating glimpse into the sound of such machines and how music and sound FX provoke and toy with the user's emotions in an effort to part them from their money. They draw parallels to sound use in other, more typical computer games and suggest ways in which sound use in electronic gambling
xvi
machines might provide inspiration for the design and analysis of sound in computer games in general. Mats Liljedahl's chapter is an attempt to redress the imbalance between visual and auditory modes in computer games. It does this by providing an overview of the use of sound in virtual environments and augmented realities, in particular, concentrating on the sound designer's required attention to emotion and flow. Using the concept of GameFlow, Liljedahl describes and explains two games he has been involved in the design of in which the sound modality is purposefully given priority over the visual. The chapter seeks to inspire and serves as an introduction to the art of computer game sound design: Sound for Fantasy and Freedom. Linda O Keeffe takes a holistic view of computer game sound by treating it as a dynamic soundscape created anew at each playing. She draws upon soundscape and acoustic ecology theory to elucidate her stance and compares and contrasts game soundscapes to real-world soundscapes. Throughout, O Keeffe prompts questions as to the listener's perception of, and relationship to, soundscapes: what is noise, what roles do context and the player's culture and society play? Ultimately, how can (and why should) we design immersive soundscapes for the gameworld? Next, Axel Berndt takes a close look at the occurrence of diegetic music in games, design principles for music games and issues of interactivity and algorithmic performance. A critique is presented of recent and current games as regards the performance of in-game music and advice and solutions are offered to improve what is currently a rather static state of affairs, merely scratching at the surface of possibility. Interactivity in music games is assessed through a critique of what is termed visualized music and Berndt proposes a framework of design that incorporates musical performance paradigms both as individual and as social, collaborative practice. This leads to Frameworks & Models which is opened by Kristine Jørgensen and whose chapter is both an exhaustive survey of the use of diegetic terminology, with regard to game sound, and a proposal for a new conceptual model for such sound. The main thrust of her argument is that the concepts of diegetic sound and non-diegetic sound have been transposed from film theory to the study of computer games with frequently scant regard for the very different premises of the two media. The interactive, real-time nature of computer games and the immersive environments of many game genres requires a radical reappraisal of sound usage and sound design for games: games are not films and the use-value of game sound is greater than that of film sound. This is followed by a chapter in which Ulf Wilhelmsson and Jacob Wallén propose a model for the analysis and design of computer game sound that combines two previous models–the IEZA Framework for game sound and Walter Murch's conceptual model for film sound–with affordance and cognition theories. The IEZA Framework accounts for the structural basis of game sound, the function of sound, while Murch's model describes sound as either embodied or encoded, a system that accounts for human perception and cognitive load limits. Combining the two systems, the authors assert, provides a powerful tool not only for analysis but also for the planning and design of computer game sound and this claim is demonstrated by a practical example. Milena Droumeva, in her chapter,filters the computer game soundscape through the precepts of Schafer's and Truax's soundscape and acoustic communication theories. Different ways of listening to game sound are proposed with assessments of the role of sound in the perception of realism: Does sound provide fidelity to source or does it provide a sense of verisimilitude and what are the strengths of each approach as regards computer game sound design? Droumeva ultimately advocates a view comprising game soundscape and player together as an acoustic ecology and expands that ecology from the virtual world of the game to include concurrent sounds from the real world.
xvii
Ulrich Reiter's chapter on Perceived Quality in Game Audio explores the bi-modality and cross-modality of auditory and visual stimuli in gameworlds. It summarizes previous work in this area and proposes a high-level salience model for the design of audio in games that accounts for both interactivity and attention as the bases for the evaluation of audio quality. Issues of level of realism and verisimilitude are discussed while the validity, and use, of Reiter's proposed model is substantiated through experimental methods outlined towards the end of the chapter. Paul Toprac and Ahmed Abdel-Meguid's chapter introduces the section dealing with Emotion & Affect. The authors present to the reader four relevant emotion theories then summarize fundamental research they conducted in order to test those properties of diegetic sound best suited to evoke sensations of fear, anxiety, and suspense. Their results, an early example of an empirical and statistical basis for sonic fear and anxiety, lead the authors to devise some rough heuristics for the design of such emotions into computer game sound and to point to directions for future research in the area. The following chapter, by Guillaume Roux-Girard, also deals with the perception of fear in computer game sound being an in-depth analysis of sound usage in the survival horror game genre that focuses on sound's ability to instil fear and dread in the player. It proposes a model for sound, based upon film sound practice and existing models for computer game sound, that is user-centric–one based on the reception of sound rather than its production–and Roux-Girard provides several illustrative examples from recent horror games to validate the model. In Uncanny Speech, Angela Tinwell, Mark Grimshaw, and Andrew Williams continue the horror theme with a look at Non-Player Character speech in horror games and its relationship to the 1970s' theory of the Uncanny Valley. The authors trace the development of theories of the uncanny from its beginnings in psychoanalysis over 100 years ago through to its practical application in robotics (as the Uncanny Valley theory) and its strong correlation to fear and anxiety in computer games. Recent empirical work by the authors is described and its implication for the design and production of Non-Player Character speech in computer games is discussed. Stuart Cunningham, Vic Grout, and Richard Picking's chapter looks at Emotion, Content, and Context in Sound and Music. The chapter is an exploration of the interaction that is possible between player and computer game sound, in particular, music playlists used in conjunction with games. The authors provide an overview of emotion research in the context of computer games and consider the emotional and affective value of sound and music to the player. The experimental work that is summarized in the chapter includes the generation of musical playlists according to the environmental context of the player: that is, the environment outside the game. Not only does this raise the intriguing situation of sensory and perceptual overlap and interplay between real-world and virtual, but the authors also suggest further possibilities such as the playlists themselves being responsive to the changing psychology and physiology of the player during gameplay. Concluding the section Emotion & Affect, Lennart Nacke and Mark Grimshaw's chapter is a study of computer games as affective activity. In this form of activity, sound has a large role to play and the chapter focuses on that role as it affects, indeed effects flow and, particularly, immersion: in the latter case, a preliminary mathematical equation is supplied for modelling immersion. The authors start with a review of psychological and physiological experiments and, combining these approaches, psychophysiological experiments on the effects of sound and image in virtual environments. Following a summary of a recent psychophysiological study on computer game sound conducted by the authors, the chapter concludes with a discussion on the advantages and disadvantages of such an empirical methodology before speculating on the implications of biofeedback for computer game sound with reference to player
xviii
interaction and immersion. The section on Technology opens with a chapter on Spatial Sound for Computer Games and Virtual Reality by David Murphy and Flaithrí Neff. The authors guide the reader through the basics of human spatial sound processing and the propagation of sound in space while pointing out the problems faced in transferring and accurately replicating these phenomena within computer game systems. They conclude with a survey of existing spatial sound technologies for use in virtual worlds, their strengths and weaknesses, and look to the future possibilities for computer game sound posed by the ongoing development of the technology. Andy Farnell's chapter comprises an in-depth critique of sample-based game audio followed by an analysis of the potential of procedural audio: the real-time design of sound. For Farnell, audio samples have proven to be too limiting, both for the purposes of creativity in game sound design and for the promise of realism: audio samples are predicated upon selection whereas procedural audio is design. A close discussion of procedural audio techniques, both as they have been used and how they might be used to the benefit of computer games, leads the author to the conclusion that it is both pointless and wasteful of computer resources to pursue precise sonic realism: procedural audio can instead be used to provide just the necessary level of realism, a perceptual realism, that is required for the player to comprehend source and source behaviour whilst saving scarce resources for more interesting and immersive tasks. Eoin Mullan's chapter delves further into the promise of procedural audio by providing a detailed exploration of the potential of physical modelling, a branch of procedural audio. He traces the technique's development from the synthesis of musical instrument sounds to its current state where it stands poised to deliver a new level of behavioural realism to computer games. For Mullan, this will be achieved through the integration of the technology with game physics engines and through physical modelling's ability to provide unprecedented levels of player-sound interaction. Current & Future Design is the subject of the next two chapters comprising the final section. First, Valter Alves and Licínio Roque present a lucid case for the importance of sound to the design and experience of computer games: attention should be paid to sound from the start of the design process the authors assert. Alves and Roque discuss concepts such as resonance and entrainment as means to engage and immerse players in the gameworld. They present 7 guidelines for game sound design and detail an illustrative example of the application of these heuristics. Lastly, in this section, Daniel Hug's chapter is a clarion call for a new aesthetic of computer game sound. Through a discussion of two dominant paradigms in computer game sound discourse, pursuit of reality and cinematic aesthetics, it details the debt that game sound owes to cinema sound but then uses examples, in particular from many innovative game developers and from cinema's own subversive stream, to shrug off that mantle and argue for a new future for game sound design. Rich in ideas and provocative in its discourse, the chapter is full of practical suggestions for making computer game sound not only a different experience to cinematic sound but an engaging and rewarding one too. Closing the chapters is an appendix which is a lightly edited transcript of an online discussion forum to which the book’s contributors were invited to attempt to debate and answer the question: What will the player experience of computer game sound be in the future? This is, of course, an open-ended question and the unstructured, lively debate that ensues is indicative of the open-ended potential for the future of computer game sound. Whatever your need in picking up the book, I hope you will find it met. At the very least, perhaps one of the contributions here will raise intriguing questions in your mind, an itch that will be scratched
xix
by future investigation on your part. Perhaps the ideas contained within will inspire you to develop a new game sound design paradigm or to innovate the technology and push the frontiers of human-sound interaction? After all, the aim of this book is not just to contribute to the development of ever better computer games or more cogent analyses but it is also to cast an illuminating light on at least one part of humankind's relationship with sound as we step out of reality into virtuality. Mark Grimshaw University of Bolton, UK
xx
Acknowledgment
An anthology such as this requires the hard work and input of many people not just the contributing authors: publishers and their staff, the Foreword author, the Editorial Advisory Board, and the reviewers. Each chapter has been exhaustively blind reviewed by two leading academics in the field and I would like to thank each and every one of them for their tireless work in support of this project: Thomas Apperley (University of New England, Australia), Roger Jackson (University of Bolton, England), Martin Knakkergaard (University of Aalborg, Denmark), Don Knox (Glasgow Caledonian University, Scotland), Theo van Leeuwen (University of Technology Sydney, Australia and member of the Editorial Advisory Board), David Moffat (Glasgow Caledonian University, Scotland), Patrick Quinn (Glasgow Caledonian University, Scotland), and Gareth Schott (University of Waikato, New Zealand and member of the Editorial Advisory Board). My thanks too to Dave Ranyard of Sony Computer Entertainment Europe Ltd. for his flattering foreword and my appreciation for the guidance and patience shown towards me by Joel Gamon of IGI Global. Finally, I must extend my apologies to my contributors and fellow authors for the hectoring they were subjected to by the editor: The end surely justifies the means. Mark Grimshaw University of Bolton, UK.
Section 1
Interactive Practice
1
Chapter 1
Sound in Electronic Gambling Machines:
A Review of the Literature and its Relevance to Game Sound Karen Collins University of Waterloo, Canada Holly Tessler University of East London, UK Kevin Harrigan University of Waterloo, Canada Michael J. Dixon University of Waterloo, Canada Jonathan Fugelsang University of Waterloo, Canada
AbstrAct A much neglected area of research into game sound (and computer games in general) is the use of sound in the games on electronic gambling machines (EGMs). EGMs have many similarities with commercial computer games, particularly arcade games. Drawing on research in film, television, computer games, advertising, and gambling, this chapter introduces EGM sound and provides an introduction into the literature on gambling sound in general, including discussions of the casino environment, the slot machine EGM, and the physiological responses to sound in EGMs. Throughout the article, we address how the study of EGM sound may be relevant to the practice and theory of computer game audio. DOI: 10.4018/978-1-61692-828-5.ch001
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Sound in Electronic Gambling Machines
INtrODUctION A much neglected area of research into computer game sound is the use of sound in electronic gambling machines (EGMs; also known as slot machines, video slots and video fruit machines). To put the influence of EGMs into perspective, the computer game industry in the United States contributes approximately $8 billion in sales each year to the country’s GDP (Seeking Alpha 2008). The slot industry, on the other hand, generates approximately $1 billion a day in wagers in the United States alone (Rivlin, 2004). Moreover, this amount is increasing as slot machines grow in popularity and are increasingly found outside of designated casinos. In 1980, an average of 45% of the gaming floor of a Nevada casino was devoted to slots, whereas today this number is at least 77%, with machines generating more than twice the combined revenue of all other types of games (Schull, 2005). Although they are also increasing in complexity (see below), slot machines are attractive to players because they require little or no training or previous experience, they are quick and easy to play and, perhaps most importantly, they elicit a number of sights and sounds that make them striking and exciting on the casino floor. EGMs have many similarities with commercial computer games, particularly arcade games. In fact, many of the early video arcade game companies also had a long prior history manufacturing slot machines, including Bally and the Williams Manufacturing Company. As such, many of the creators and designers of slot machines today have also worked for computer game companies. In fact, much of the sound design and music of slots is still outsourced to game sound designers and composers, such as George “The Fat Man” Sanger (composer of 7th Guest, Wing Commander, and others). Furthermore, until the 1990s slot machines had fairly standard mechanical or electro-mechanical reels and parts. Today, however, with the digitization of slot machines there are now considerably
2
more structural components to slot machine gameplay. Many of these structural components have been adapted from computer games, such as cut scenes, bonus rounds and specialist plays. And while the arm of the “one-armed bandit” remains on many slot machines, more commonly players use simple rectangular or round blinking buttons very similar to those of many arcade games. There are also, of course, some notable differences between computer games and electronic gambling machines. Historically, the vast majority of EGMs have been exclusively installed in casinos, where the usual age for entry is 21, thus effectively excluding young people from gameplay. However, this is changing as the companies attempt to capture a younger audience and the machines proliferate in non-gambling environments (Rivlin, 2004). Today, EGMs can be found in bars, restaurants, arcades, hotel lobbies, and entertainment and sporting venues. There are also, of course, virtual slot machines online, and these represent a significantly growing proportion of slot income. Research has further shown that casinos and gaming companies are seeking to target women, particularly those over 55 as its main demographic, although as the venues change, the target market is becoming younger. Electronic gambling machines today are also much faster to play than their mechanical and electronic ancestors. Now, the average player initiates a new game every 6 seconds (Harrigan & Dixon, 2009a, p. 83), playing up to 600 games per hour, and there are even artificially intelligent machines that adapt to the speed of the player— when they start slowing down, the machine will slow down with them, but work to build them back up after a little break. Many games aim for “immersion” (what might be best described in terms of Csíkszentmihályi’s concept of “flow”, characterized by concentration on the task at hand, a sense of control, merging of awareness and action, temporal distortion and a loss of selfconsciousness—see Csíkszentmihályi, 1990). It is, however, often possible to jam the button
Sound in Electronic Gambling Machines
with a piece of card, and let the machine play on its own for even faster results. Most machines also include a “Bet Max” function, a one-button mechanism that simultaneously allows players to wager the maximum allowable amount and to spin the reels—a function that encourages both faster wagering and continuous, rapid gameplay requiring a minimum of attention from distracted players.1 Thus, a “nickel slot” can mean wagers of up to about $4 per bet, although these are typically displayed in “credits” of 25-cent allotments so the illusion is that the player is betting less. The biggest distinction between slot machines and computer games is, of course, the aspect of financial risk added to gameplay, which adds a potential new level of psychological, cognitive, and emotional involvement in the game (we say potential because these distinctions are as yet unexplored in the research). The win-loss component of electronic gambling games is more complicated than it at first appears, with “losses disguised as wins”, and “near-misses” (see below). These are carefully doled out according to a reward schedule, based on scientific research about how long we will play before needing a win to keep motivated (see Brown, 1986). Reward schedules have also been built into computer games, particularly hunter-gatherer type games in which the player must spend considerable time roaming lands and collecting objects.2 Some psychologists suggest that the reward schedule combined with the rapidity of the gameplay is similar in character to the effect of amphetamines, stimulating the on-off cycle that repeatedly energizes and de-energizes the brain. This link is supported by functional magnetic resonance imaging studies revealing that brain scans of active gamblers and active cocaine users reveal similar patterns of neurocircuitry (Crockford, Goodyear, Edwards, Quickfall, & el-Guebaly, 2005). It has been suggested that there are many different motivations for gambling, with a distinct dichotomy between arousal/action seekers and those who seek escape/dissociation. In other words, slot machine games are designed
to simultaneously satisfy different needs of different players. In this chapter, we will introduce the literature of EGMs and related phenomena to the reader with a specific focus on the use of sound. A brief introduction to the structural components of gameplay is followed by an examination of existing studies on the sonic elements of casinos and gambling and an exploration of how this knowledge might apply to computer games.
strUctUrAL cOMPONENts OF EGM GAMEs A slot machine essentially involves three or more reels (in today’s EGMs, these are often computergenerated digital simulations, rather than actual mechanical parts). Touch-screen machines typically do not have handles, but rather the reels are spun by the player pressing a button (the one-armed bandit style pull-lever handle still exists on most slot machines, but is not often used). When the reels stop spinning, three or more icons (often up to five) will line up on the payline for a win, but other combinations of icons can also lead to a win (diagonal lines, and so on), with the amount won relating inversely to the probability of the symbol coming up on the payline (Turner & Horbay, 2004). Payouts vary by country/state/province and by initial betting amount, ranging from about 80 to 95%—in other words, a fairly significant number of plays result in some form of a “win” (see below for information about these “wins”). The amount bet on a win can vary also—the player can, for instance, be playing a “nickel slot” but can end up betting several dollars on a single play by betting on a larger number of potential payout lines. Moreover, with EGMs, the number of payout lines also varies. For example, Lucky Larry’s Lobstermania made by IGT, has five reels and 15 possible paylines. The maximum wager is 75 credits ($3.75), while the top prize is 50,000 credits ($2,500). There are also two different bonus
3
Sound in Electronic Gambling Machines
rounds available depending on the version of the game: a Great Lobster Escape, and a Buoy Bonus round in which additional payouts are guaranteed but the amount of payout varies.3 In these bonus rounds, the player is asked to select from a variety of options, giving the player the illusion of control and the perception of skill. The use of a stopping device, for instance, in which the players can stop the spinning of the reels voluntarily, increases the perception that the stopping is not random but that there is some form of skill involved: By having that control, there is an increased probability of success, thus making the game more attractive to the player (Ladouceur & Sévigny, 2005). Indeed, slot machines today can feature a library of game variations, in order to increase what the industry calls “time on device” (Schull, 2005, p. 67). Some features of EGMs (and particularly bonus rounds) such as nudge and stop buttons, give the illusion of control to the player—an important component but one that the gaming industry has referred to as being an “idiot skill” (Parke & Griffiths, 2006, p. 154). This perhaps calls to mind the “button-mashing” skill of the early arcade game beat’em-up genre.4 David Surman (2007) notes that Capcom’s 1987 arcade hit Street Fighter, for instance, was released with a touch-sensitive hydraulic button system in which the increase of the player’s pressure on the button related to the power of the player’s character’s kicking and punching, thus encouraging players to bang and smash on the buttons. He states: “This ‘innovation’ led to many machines being rendered defunct by over-zealous players smashing the control system. The cacophony of these large red buttons being bashed would come to signify the arcades which stocked a number of these first Street Fighter units” (pp. 208-209). When the player has an increased perception of control, they are more likely to engage with the game, play for longer, and spend more money. Bonus or built-in “secret” functions (often a cancel button, slow-down or hints—these are typically not actually secret but often not immediately
4
apparent) also increase the illusion of control. The bonus elements of gameplay are sometimes hinted at by the sound (as in The Simpsons EGM, in which Krusty the Clown says “Here’s a clue for ya, Jack”). A simple bonus or increased skill component leads to an increased psychological involvement on the part of the player and, it is suggested, has a “significant effect on habitual gambling” (Parke & Griffiths, 2006, p. 176). The use of these functions helps to keep players interested in that they hope that they will learn the “secrets” of the machine and thus be able to demonstrate their skill through winning as well as increase their winnings. Of course, similar bonus rounds and “Easter Eggs” are often built into computer games to reward the regular player who has taken the time to find them—thus upping the player’s credibility amongst other gamers. Usually superfluous to gameplay, Easter Eggs are nevertheless viewed as rewards for the time spent on the device (see Oguro, 2009). But even beyond the world of Easter Eggs, players develop skills beyond the initial simple skills required to technically play a game, notes Surman (2007): While a player new to videogames explores the pleasures of the gameworld with the clumsy curiosity of a toddler, as one becomes a more sophisticated gamer other pleasure registers come into play, which are concerned with a literacy of sorts in which one is sensitive to the codes and conventions of the gameworld, and the panoramic experience of worldliness reduces to a hunt for the telltale graphical or acoustic ‘feedback loops’, confirming success in play. Still higher, as the core gameplay becomes exhausted, players end up centring on the reflexive undoing of the gameworld; pushing it to its limits, exploring and exploiting glitches, ticks, aberrations in the system. (p. 205) This description fits closely with Csíkszentmihályi’s (1990) ideas of the requirements for flow (immersion), where a careful balance between difficulty and skill is required to continually en-
Sound in Electronic Gambling Machines
gage a player in an activity. As the skill increases, so must the difficulty, or the player will become bored. If the skill required is too difficult for the novice, the player will likewise lose interest. Equally important to the psychology of the player are the built-in gambling machine concepts of the “near miss” and the “loss disguised as a win”. A near miss is a failure that was close to a win—such as two matching icons arriving on the payline followed by a third reel whose icon sits just off the pay-line. Slot machine manufacturers use this concept to create a statistically unrealistically high number of near misses (Harrigan, 2009), which convinces the player that they are close to winning, and therefore leads to significantly longer playing times (Parke & Griffiths, 2006). Described gambling researchers Jonathan Parke and Mark Griffiths (2006): At a behaviourist level, a near miss may have the same kind of conditioning effect on behaviour as a success. At a cognitive level, a near miss could produce some of the excitement of a win, that is, cognitive conditioning through secondary reinforcement. Therefore, the player is not constantly losing but constantly nearly winning. (p. 163) A loss disguised as a win, on the other hand, is a play in which the player “wins” but receives a payout amount of money less than that of the amount wagered, hence actually losing on the wager despite being convinced (sonically) that they have, in fact, won. So for example, a gambler might wager $2 on a play and win $1.50 back. S/he is actually losing 50 cents, but is given the reinforcement cues (see below) of a win. An important contributing factor to all of these illusions that increases playing time and increases money lost is sound. A small number of previous studies of sound in slot machines have shown that sound influences a gambler’s impression or perception of the machine, including the quality of the machine (the fidelity of the sound is a primary reason for selecting one EGM over another),
helping to create a sense of familiarity, branding or distinguishing the machine, and creating the illusion of winning, since players may only hear winning sounds (Griffiths & Parke, 2005). Furthermore, Dibben (2001) argues that, for listeners, the reception of music and sounds are not only embedded in the material and physical dimensions of hearing but are also, and critically, grounded in social and cultural knowledge and awareness, based on “listeners’ needs and occupations” (p. 183). This idea—that response to music and sounds can be influenced by culture and personal experience—has self-evident relevance for a study focusing on the role of sound in relation to individuals immersed in gambling environments and/ or those at risk for addictive gambling behavior. We will first cover the environment in which the slot machines are commonly found and then focus on the machines themselves.
cAsINO sOUND: ENVIrONMENtAL FActOrs The sound of electronic gambling machines in the context of a casino can play a significant role in the perception of the games. Background music in the casinos or bars changes throughout the day, with pop music played in daytime, and relaxing music in the evenings (Dixon, Trigg, & Griffiths, 2007). The noise and music gives the impression of an exciting and fun environment and, critically, that winning is more common than losing. In fact, Anderson and Brown (1984), in a comparison of response to gambling in a laboratory and a casino setting, found that in the casino, the player’s heart rate increases considerably. Moreover, increased exposure to the casino setting in problem gamblers leads to an increased arousal response. They note that “[t]he constant repetition of major changes in autonomic or other kinds of arousal associated in time and place with various forms of gambling activity is likely to have a powerful classical or
5
Sound in Electronic Gambling Machines
Pavlovian conditioning effect on gambling behavior” (p. 400). There has been considerable research into environmental sounds and its impact on consumer behavior in regards to advertising and retail. Servicescapes—that is, the soundscape and landscape of the service environment—have been one recent area of focus in advertising and marketing research. A pleasant ambience, it is felt, is key to a pleasurable shopping experience. Congruency in ambience between the brand, sounds scent, and other aspects of the store are vital to a positive consumer experience (see Mattilaa & Wirtz, 2001). Companies like the now-defunct Muzak have, of course, built businesses on this idea. Alvin Collis, VP of strategy and brand for Muzak, outlines the concept of the servicescape: I walked into a store and understood: this is just like a movie. The company has built a set, and they’ve hired actors and given them costumes and taught them their lines, and every day they open their doors and say, ‘Let’s put on a show.’ It was retail theatre. And I realized then that Muzak’s business wasn’t really about selling music. It was about selling emotion—about finding the soundtrack that would make this store or that restaurant feel like something, rather than being just an intellectual proposition. (see Owen, 2006) Certainly, statistics seem to back-up Muzak’s ideas, with some studies suggesting that young people spend 36% more time in a shop when music is being played, that if Muzak is played in a supermarket, it will increase the percentage of customers making a purchase there by 17%, and so on (KSK Productions, n.d). Generally speaking, consumers spend longer in environments when there is some form of background music as long as the volume is low and uncomplex (Garlin & Owen, 2006, p. 761). Music tempo changes can alter the length of time a shopper spends as well as the amount of money. Not only this, but music can also influence the perceived amount of time
6
spent. Young people under 25 perceived that they had spent longer in an “easy listening” store condition, while older shoppers perceived that they had spent longer in a Top 40 store condition. Familiar music led to the impression that they were shopping longer (Yalch & Spangenberg, 2000). Muzak’s website described of its music concept (what it terms “audio architecture”): Its power lies in its subtlety. It bypasses the resistance of the mind and targets the receptiveness of the heart. When people are made to feel good in, say, a store, they feel good about that store. They like it. Remember it. Go back to it. Audio Architecture builds a bridge to loyalty. (Muzak Corporation, n.d) Music is, of course, not the only element of environmental sound that plays into the overall ambience. Sound effects, such as in Discovery Channel’s stores with sound zones, or a Canadian supermarket close to one of the authors, Sobey’s, which has chirping birds and frogs in the produce aisle, can also create an overall atmosphere. Both sound effects and music can help to quickly identify a brand for consumers without prior experience of that brand. Music can cue the shopper as to the intended market, and a poor choice of music can clash with the values of the brand (Beverland, Lim, Morrison, & Terziovski 2006). Griffiths and Parke (2005) draw on a theoretical model by Condry and Scheibe (1989) regarding persuasion in advertising and adopt this model for slot machine sound. They suggest that there are stages in the persuasion process that involves a person committing to the machine. This begins with exposure (they must be exposed to the machine and that might be in a bar) and leads to attention (in which sound plays a particularly important role to draw attention in a noisy atmosphere). From there, comprehension and yielding takes place— a familiar musical theme helps draw the player in, believing the machine is socially acceptable because the sound is likable and familiar. Finally,
Sound in Electronic Gambling Machines
the retention and decision-to-gamble stages occur. In other words, sound is used to draw people in, make them feel comfortable, and convince them to play. The authors hypothesize that the background sounds and music might increase confidence of the players, increase arousal, help to relax the player, help the player to disregard previous losses, and induce a romantic state leading them to believe that they may win. One study into the effect of background music on virtual roulette found that the speed of betting was influenced by the tempo of the music, with faster music leading to faster betting. Another suggests that there are two main types of casino design: a playground design (spacious, with warm colors, vegetation, and moving water) and a low-ceiling, crowded and compact area. This study found that music increased perceived at-risk gambling intentions in the playground casino design while decreasing the intentions in the other gambling design. In the presence of just ambient sounds, however, this finding was reversed (Marmurek, Finlay, Kanetkar, & Londerville 2007). What is certain is that the flashing lights, the room lighting, the carpeting, and visual design of the space, the conflicting smells of food, perfume and alcohol, and in particular the use of loud sounds serves to at once create feelings of excitement and luxury as well as serving to distract the player by increasing cognitive load (the efforts involved in processing multi-modal information and use of working memory) (see Hirsch, 1995; Kranes, 1995; Skea, 1995). Multiple conflicting stimuli and calls on attention leading to this increased cognitive load causes people to process information using guessing, stereotyping, and automatic response to stimuli rather than reasoned and rational response and introspection.5 This depends, somewhat, on the type of music involved, as well as the personal perception of the individual involved (Carter, Wilson, Lawsom, & Bulik, 1995; McCraty, Barrios, Atkinson, & Tomasino 1998; Wolfson & Case, 2000).
Some slot machines, however, employ noise cancellation technology to remove any “destructive interference” that may distract a player from the flow of gameplay, to increase immersion (Schull, 2005, p. 67). An Australian study found conflicting reviews of background ambience, with some players getting distracted, and others reporting excitement: “You can go either way when you hear somebody else going, you can get all hyped up and think, gee their machine’s going I could also have it, or it could go the opposite, why isn’t my machine paying. It has a double affect” versus, “The minute I hear the ‘ching, chong China man’, I quickly run around to see”… Two participants noted that the music made them “anxious” and “desperate” as they believed that everyone else around them was winning something, when they were not” (Livingstone, Woolley, Zazryn, Bakacs, & Shami, 2008, p. 103). Computer games today are rarely consumed in an arcade environment whose music and sound can be manipulated, but the use of non-diegetic music in games as well as the use of ambience could be adjusted to take into consideration some of the results of these studies. For instance, altering the perception of time through the use of changing tempos or generating feelings of excitement with carefully timed sound effects in the ambient world may help to engage the player. There are also implications here relating to games that require further research. In particular, how does the fact that players can substitute their own music in Xbox360 games influence their perception of gameplay? How does the use of familiar music impact the player’s perception of unfamiliar games? These questions are outside the scope of this chapter, but clearly have important consequences in regards to player engagement with and enjoyment of a game. Of course, more easily manipulated than the environmental space in which gameplay takes place, is the use of sound in the games themselves.
7
Sound in Electronic Gambling Machines
sOUND IN EGMs The earliest slot machines, such as the Mills Liberty Bell of 1907, included a ringing bell with a winning combination, a concept that is still present in most slots today. Playwright Noël Coward noted that sound was a key part of the experience in Las Vegas: “The sound is fascinating . . . the noise of the fruit machines, the clink of silver dollars, quarters, nickels” (in Ferrari & Ives, 2005). As in the contemporary nickelodeons, sound’s most important early role was its hailing function, attracting attention to the machines (Lastra, 2000, p. 98). Sound in EGMs has advanced alongside the technological changes introduced into the machines in the last few decades. EGMs are now using computer-generated graphics, popular music, and high-fidelity sampled sound rather than relying on mechanical ball-bearings, bells or basic square-wave synthesizer chips. Today, sound effects in EGMs are used for a variety of feedback and reward systems. Up until about the early 1990s, slot machines featured about 15 ‘’sound events’’, whereas they now average about 400 and are often carefully researched to manipulate the player (Rivlin, 2004, p. 4). Sound designer George Sanger described that sound is created “by committee” and that the committee “always want it to be more exciting” with little consideration for a dynamic range in the excitement portrayed (Personal communication, October 15 2009, Austin, TX). This includes sound effects of coins falling even though many slot machines neither accept nor pay out coins anymore. Notes Bill Hecht, an audio engineer for IGT, “We basically mixed several recordings of quarters falling on a metal tray and then fattened up the sound with the sound of falling dollars” (Rivlin, 2004, p. 3). Moreover, these false coin sounds can portray wins much larger than the actual win. Unpredictable sounds in particular help to capture and maintain our attention (Glass & Singer, 1972). There has even been a recent patent to randomize winning sound effects in order to
8
increase the perception that the sound is more real than it is in actuality and to reduce the recognition that it is merely careful programming at play. The patent describes: In the conventional slot machine… the sound effects generated from the speaker are based on only one kind of sound effect pattern. For example, when a big bonus game occurs, a fanfare indicating the occurrence of the big bonus game is sounded, and so forth. Meanwhile, with a slot machine in which a special game has once occurred, the player typically keeps playing games while expecting special games to further occur. In this case, if the sound effects (winning sounds) identical to those at the first occurrence of the special game are generated upon the second or later occurrence, the pleasure of gaming may not fully be enjoyed. (Tsukahara, 2002, p. 1) Slot machines use pseudo-random number generators carefully programmed to elicit the right reward schedule, however, and there is no real skill involved, only manipulations of perception. Recent research findings that music can increase success rate, for instance, are fallacious because it is simply not possible. Yamada (2009) for example proposes that: Results indicated that the no-music condition showed the best rate of success. Moreover, a “mixed” musical excerpt added “unpleasantness” to the game and, in turn, resulted in a negative effect on the success rate. Increasing the speed increased the “potency” of the game, but did not affect the success rate, systematically. In the second experiment, we used the two excerpts performed in various registers and with various timbres as musical stimuli. (p.1) It is unclear if Yamada custom-designed the games that were tested or if the test was for illusion of success and perceptions of gameplay rather than actual success, and neither is it clear
Sound in Electronic Gambling Machines
if the game involved was a custom-built game for the purposes of the study (we could find no references to the game in Google), and so Yamada’s findings remain dubious at best. However, what Yamada’s work does show is that it is highly likely that music plays an important role in increasing the illusion of success. Of course, winning sounds are particularly important to the popularity and attraction of the machines and losing sounds are rarely heard. When losing sounds are used in some machines, they are intentionally employed to antagonize the player, creating a short-term sense of frustration that, it has been suggested, prolongs the play period in what has been called “acoustic frustration”: Antagonistic sounds invoke frustration and disappointment. For example, on The Simpsons fruit machine, Mr. Smithers smugly informs Homer Simpson that, “You’re fired”, or Chief Wiggam says, “You’re going away for a long time”. At present, we can only speculate about the consequences of such sound effects. In line with hypotheses supporting frustration theory and cognitive regret… this might make the fruit machine more inducing.” (Parke & Griffiths, 2006, p. 171) This idea of acoustic frustration could be adapted and utilized by computer games more effectively than is currently seen. For instance, commentary on gameplay (see below) is common in some types of games but absent in most computer games. Sound effects and music could play a commentary effect without using dialogue as well. The types of sounds used are particularly important to their affective power. Pulsating sounds that increase in pitch or speed (vibrato and tremolo) have been shown to help to increase tension and verbal reinforcements (both negative and positive) are used to goad the player on with a sensation known as perceived urgency (see Edworthy, Loxley, & Dennis, 1991; Haas & Edworthy,1996). The deeper a player gets into a
game, the louder and quicker the music usually becomes. High pitched sounds—very common to slot machines—are also very useful in attracting our attention as they perceptually appear closer to us. Notes Millicent Cooley (1998): “Advertisers use this principle when they pump television commercials full of high frequency audio that makes characters sound as if they are intruding into viewers’ homes.” The types of sounds used in EGMs are also carefully chosen according to Western cultural likes and dislikes. As one study of pleasing sounds found, chimes are particularly highly rated: “Our highest rated sounds generally related to escapism (e.g., fantasy chimes, birds singing) and pleasure (children laughing)” (Effrat, Chan, Fogg, & Kong, 2004, p. 64). Large wins in slot machines are characterized by a “rolling sound” with the length of the win tied to the length of the music cue. Winning sounds are often carefully constructed to be heard over the gameplay of other players to draw attention to the machine and to raise the self-esteem of the player, who then becomes the center of attention on the slot machine/casino/bar floor (Griffiths & Parke, 2005, p. 7). Often, this music contains high pitched, major mode songs with lots of chimes and money sounds. Higher pitch also has a tendency to increase the perception of urgency, with that increase in perceived urgency corresponding to an increase in pitch, but it also helps to cut through the ambient noise of a busy casino (Haas & Edworthy, 1996). There are several implications here for computer game sound. First is the reinforcement role of both encouraging and antagonistic sound. Sonic rewards are under-utilized in games and the idea of a reward schedule, while it has been used in computer games, is likewise unusual. To tie the two together—to have a system of sonic rewards at anticipated specific timings in the game—can help to keep a player interested for longer. Losing sounds, as discussed above, are perhaps the equivalent of player health decreases or death in a computer game. It is quite common for computer
9
Sound in Electronic Gambling Machines
games marketed to children to sonically represent the player’s character’s death as a not particularly negative event. This may in fact even be silence upon the character or game’s end (the equivalent of not hearing a losing sound in a slot machine), a fun “raspberry”, a game show-like losing sound (as in Rocky and Bullwinkle on the Nintendo Entertainment System (NES)), or a cheery “try again” music (as in the Jetsons or Flintstones game-over music for the NES). On the other hand, in more adult-oriented games, the player’s death can be a much more negative event with serious funeral dirges. It may be worthwhile for sound designers to explore the possibility of including both more losing-type sounds in other places within the game, in order to increase the acoustic frustration the player feels, thus enhancing the impact of winning sounds and increase emotional engagement. Psychological studies have shown that frustrative non-rewards are considerably motivating. In simple terms, “failing to fulfill a goal produces frustration which (according to the theory) strengthens ongoing behavior”, leading to cognitive regret, encouraging persistent play in the desire to relieve the regret (Griffiths, 1990; see also Amsel, 1962). Note King, Delfabbro, & Griffiths (2009): Video games have also become longer and more complex, making a punishment like permanent character death an unappealing feature, particularly for a less committed, casual playing audience. Common forms of punishment in games include having to restart a level, failing an objective, or losing resources of some kind, like items, XP or points. (p. 10) It is possible, therefore, to improve the sound of losing tied to these lesser events, in order to tap into the acoustic frustration effect seen in slot machines. While we typically hear sounds tied to these events in current games, a stronger sense of loss (and thus, upon winning, reward) may improve player involvement.
10
Likewise, the concepts of near misses and losses disguised as wins are elements popular in slot machines but rarely—if ever—heard in computer game sound. One might imagine, for instance, a “mini-game” within a larger game in which the player is sonically teased with almost winning a bonus round or is given the impression that they have won more points than they actually have within that bonus round. This would probably, of course, only be useful for certain types of games aimed at certain types of players. One can imagine this effect in a Wii casual game designed for all ages, for instance, but less so for a big budget first-person shooter title on the Xbox360.
slots, Familiarity and brands Important to feelings of player comfort and emotional connection to the machine is the role of branding EGMs by using well-known intellectual property. Popular songs are often used to attract a player to the machine and to cause players to feel more comfortable and familiar with that machine. Similarly, sound can play a role in branding by certain companies which create distinctive winning sounds in an effort to have their sounds heard over the din of the casino. Indeed, branded EGMs are becoming both more commonplace and more popular in casino environments. Whereas once producers of popular culture sought to remain apart from the perceived negative connotations associated with the gambling industry, today films like Top Gun, and Star Wars, television game shows like Jeopardy!, Deal or No Deal and The Price Is Right, and musical acts like Elvis Presley, the Village People and Kenny Rogers all have branded EGMs (Dretzka, 2004). Familiarity with a television show, film, person, place, musical act or sport can, for instance, entice players to the machine because it may “represent something that is special to the gambler… Players may find it more enjoyable because they can easily interact with the recognizable images and music they
Sound in Electronic Gambling Machines
experience” (Griffiths & Parke, 2005, p. 5). As Dretzka (2004) observed: Seemingly overnight, casinos actually began sounding different. Instead of clanging bells, mechanical clicks and clacks, and jackpot alarms, the soundtrack was more of an electronic gurgle and hum, with bursts of ‘This is Jeopardy!’, Wheel of Fortune!’ and snippets of rock songs. A generation of Americans raised in front of their television sets ate it up. Moreover, familiarity and repetition of musical themes has been shown to have a positive influence on our liking of the music (see Bradley, 1971). Verbal reinforcement with known characters (as well as, to a lesser degree, unknown characters) also takes place, as seen above, with familiar characters telling people that they are “cool” or “a genius”. Parke and Griffiths (2006) note that verbal reinforcement that increases play is designed to raise self-esteem, give hints and guidance, and even provide friendship or company (p. 171). An unexplored area of research is the relationship between verbal reinforcement and the anthropomorphizing of slot machines. Describes Langer (1975), with regard to such anthropomorphism: “Gamblers imbue artifacts such as dice, roulette wheels, and slot machines with character, calling out bets as though these random (or uncontrollable) generators have a memory or can be influenced” (see also Gaboury & Ladouceur, 1989; Toneatto, Blitz-Miller, Calderwood, Dragonetti, & Tsanos, 1997). It is very likely that sound plays a considerable role in the anthropomorphizing of slot machines—particularly in those cases where the machines “talk” to the player, but also in the mere fact that they are sonically responsive to our input. In reference to the game show computer game You Don’t Know Jack, Millicent Cooley (1998) notes that the player: Will be aggressively challenged to prove that you know jack (anything at all), and you know this,
again, because of the dialog and swaggering, aggressive tone of the host. The machine is in charge and you, the player, are not; the game is quick-paced, there is a sense that you will be rushed along and should try to keep up and prove that you do, in fact, know jack. You feel this pressure because the voice of the host rushes you to sign in, taunting you impatiently at every step. (p. 8) It is possible that a similar process is at work with slot machines—that is to say, the taunting will increase the speed with which the player plays, antagonizing the player to the point where the player loses focus on what truly matters (that is, the loss of their money). In reference to sonic branding, Jackson (2003) suggests that the voice heard links to the perceived personality (including perceived behavior and perceived appearance) of the speaker and, therefore, of the brand (p. 135), and it is equally likely that a similar effect is seen in the perceived personality of the machine. It has been said that 38% of the effect we have on other people can be attributed to our voice, with only 7% to the actual words we’ve spoken (the rest being body language) (Westermann, 2008, p. 153). In a study into voice and brand, UK Telecom provider Orange identified a series of attributes that define the sound of a voice: rhythm (emphasis is placed on what is said); pitch (high versus low), melody (rhythm and pitch together; pace (speed), tone (overall musical quality); intonation (what is said relating to how it is said), energy; clarity; muscular tension; resonance pause, breath; commitment, and volume (Westermann, 2008, p. 153). Each of these attributes work together to impact our perceptions of what is being said. Particularly notable is the impact that the voice (and what it says) can have on our perceptions of what we are seeing and/or experiencing. Several studies have shown how the voice influences our perception of video sports performances. In a study of sports commentary, Bryant, Brown, Comisky, & Zillmann, (1982) discovered that our enjoyment of watching sports
11
Sound in Electronic Gambling Machines
is largely tied to the dramatic embellishments provided by the commentary of the sportscasters. However, it is not only our enjoyment but also the ways that we interpret what we are seeing that is influenced by commentary. In one study, it was found that commentary affected the perception of aggression of the players in an ice hockey match. (Comisky, Bryant, & Zillmann, 1977). Not only this, but the more aggressive commentary was also perceived as more enjoyable. Other similar studies have reached similar conclusions in commentary in a tennis match, (Bryant, Brown, Comisky, & Zillmann, 1982), a soccer game (Beentjes, Van Oordt, & Van Der Voort, 2002) and a basketball game (Sullivan, 1992). This influence of commentary on perception is likely to play an equally important role in slot machines as well as computer games, although this remains another area of game sound largely unexplored. Sports games in particular make use of commentary although it is also very common to find commentary in games that imitates television game shows. It is possible, therefore, that the addition of a narrative of events in some games may impact the player’s perception of their gameplay, as well as their enjoyment of the game, although the technique is clearly under-utilized. Another trait discussed above that is highly popular in slot machines but less common in computer games is the use of familiarity and branding tied to the machines. Not only do the games themselves have distinctive sounds, but each company has its own overarching style and aesthetic that can be quickly learned upon spending time on the casino floor. The coin sounds from an IGT slot machine, for instance, sound different from those generated by a Bally machine. While this acoustic branding is particularly relevant in an environment where machines are competing for attention, the relevance of creating a distinctive sound and branding franchise games or episodic games remains in other environments also. Some computer games have, of course, employed this technique—the Super Mario Bros series, for
12
instance, has maintained a distinctive aesthetic through countless incarnations, platforms, and technological improvements. However, there are many games that still do not attempt to capitalize on this ability to entice experienced players to a new version of the game with the creation a distinct, recognizable sound.6
rEsPONsEs tO EGM sOUND The response of players to slot machine sounds is diverse, representing the different needs and desires of the players. For many, music and sound signify success, as one study has found: “I like it when it’s going long [the music], because you know you’re winning plenty of money. When they’re short, I don’t like them…” (Livingstone et al., 2008, p. 103). Other players—those which by their comments appear to be more regular gamblers—dislike the sounds, the study found: “sounds are too loud and attract attention. If someone lets the feature music go on and on they are not serious—the problem gamblers hate hearing it go on and on—and it draws attention to you” (Livingstone et al., 2008, p. 103). A few other participants also reported pressing “collect” straight after a win specifically to stop the music from playing. While some players found the sounds of others winning exciting, others felt that it gave them the impression that “everyone is winning but you” (Livingstone et al., 2008, p. 103). One study regarding sound’s presence (as on or off) showed that players strongly preferred sound to be on (Delfabbro, Fazlon, & Ingram, 2005). Response to sound, therefore, can vary from player to player, but some typical responses can be summarized. Studies of the physiological response to sound (typically industrial noise, but also including music, speech, and other sounds) have found that sounds can contribute to increases in blood pressure and, most importantly, impair performance on a vigilance task (Smith & Morris
Sound in Electronic Gambling Machines
1997). Wolfson and Case (2000) studied heart rate response to manipulation of loudness of sound in a computer game, finding that louder sounds led to increased heart rate, and discussed the impact that physiological arousal has on our attention levels. They note: People performing a task when minimally aroused are more likely to be slow, indifferent, and spread their attention across a wide range of stimuli. When highly aroused, people tend to be faster but less accurate, and they focus mainly on the most salient aspects of a task. Thus both high and low levels of arousal can have detrimental effects on performance. (Wolfson & Case, 2000, p. 185) Physiological responses to stimuli can be tested using a variety of measures, including (but not limited to) electroencephalograms (EEGs), facial electromyography, heart rate, pupil dilation and electrodermal response. Galvanic skin response (GSR), one component of electrodermal response, also known as skin conductance response or sweat response, is an affordable and efficient measurement of simple changes in arousal levels—one of the reasons why it is the main component of a polygraph device. Essentially, GSR measures the electrical conductivity of the skin, which changes in resistance due to psychological states. (See Nacke & Grimshaw, 2011 for the use of such measures when assessing psychophysiological responses to computer game sound.) Studies using GSR on subjects while being exposed to music date back to at least the 1940s (for example, Dreher, 1947; Traxel & Wrede, 1959) but are highly contradictory due to the conditions in which the studies took place. Sound and music has a known influence on listener’s arousal and anxiety levels, but this depends on many factors including the degree of musical knowledge, the tempo of the music, the familiarity with the music, preference for the music, and recent exposure to that music. Smith and Morris (1976) found that stimulating music increased worry and anxiety, al-
though they tested their student subjects during an examination. Rohner and Miller (1980) found that music had no influence on anxiety levels. Pitzen and Rauscher (1998) and Hirokawa (2004), on the other hand, more recently found that stimulating music increased energy and relaxation (increasing GSR but not heart rate). Although there are many studies about music in isolation and its physiological effect on listeners, there has been much less research on music’s impact on GSR while taking into consideration the interaction between sound and visual image (for example, Thayer & Levenson, 1983). Perceptual studies (non-physiologically based research) from the field of advertising suggest that image and sound, when used congruently (that is, for instance, when both have a similar message), tend to amplify each other (for instance, Bolivar, Cohen, & Fentress, 1994; Bullerjahn & Güldenring, 1994; Iwamiya, 1994). There have also been studies into the physiological effects of gambling, which have shown that pupils may dilate, heart rate may increase, and skin conductance levels increase (raising the GSR). Collectively, these are known as arousal levels, and it is the arousal inducing properties of slot machines that are affected by winning and losing, with increased arousal levels for wins (such as Coventry & Constable, 1999; Coventry & Hudson, 2001; Sharpe, 2004). Additionally, a number of studies, for instance, research by Dickerson and Adcock (1987), have questioned whether there is a connection between physiological responses to gambling and wider psychological issues governing perceptions of such elements as gambling environment, luck, and mood. These studies suggest there is some evidence to support both psychological and physiological responses to gambling behaviors are fuelled in part by a player’s illusion of control (for example, Alloy, Abraham, & Viscusi, 1981). In more recent research into computer games and the computer gaming environment, Hébert, Béland, & Dionne-Fournelle (2005) have discovered that, “for the first time…auditory input
13
Sound in Electronic Gambling Machines
contributes significantly to the stress response found during video game playing” (pp. 23712372). This research suggests that physiological responses to music in computer games may be linked in part to genre, noting generally that the more aggressive and rapid the music, the more elevated physiological stress levels become. A recent pilot study into the sounds and sights of losses disguised as wins was undertaken with 16 participants by the University of Waterloo’s Problem Gambling Research Group. Each participant played Lucky Larry’s Lobstermania for 45 minutes while being tested for their arousal levels using GSR. Participants wore a GSR recording device on their fingers while they played, with the output from the GSR being tied to two wires which output when the player pressed the play button and whether or not the play resulted in a win, loss disguised as a win (where payout is less than spin wager) or a regular loss (that is, losses without reinforcing sounds of a win). As might be expected, the highest GSR rating—indicating the highest arousal level—was found with wins, with the lowest rating with regular losses. What is particularly interesting, however, is that losses disguised as wins were much closer physiologically to wins, than to losses. In other words, hearing the sounds of winning, even though the player has lost money, is enough to trick the mind/ body into believing that the player is winning (Dixon, Harrigan, Sandhu, Collins, & Fugelsang, forthcoming). In the case of losses disguised as wins, these games play on the idea of synchresis. Film theorist Michel Chion (1994) defines synchresis as “the forging of an immediate and necessary relationship between something one sees and something one hears,” combining the ideas of synchronism (simultaneous events) with synthesis (p.5). Essentially, sound changes our perception of the image that we see and, despite there being an opposing relationship between sound and image, we view images as connected to sound when they are played concurrently, with the sound dominating
14
our response. With losses disguised as wins, the numbers displayed on the machine tell us that we are losing (in other words, we “won” 50 cents, but our total credits and cash have been reduced since the last play) but the sound tells us that we are winning. In a sense, the sound overrules our eyes and leads the emotional (and physiological) response to the event. This phenomenon illustrates the importance of sound to our overall perception of audio-visual media, and demonstrates one under-utilized way that sound is used in computer games. Far from merely reinforcing image, sound can have a much more complex relationship with what is occurring on screen. We might use a “winning cue” sound for instance in a battle scene to trick the player into thinking that the evil “big boss” enemy is dead, only to have them return to life. Or, we might use sound into tricking the player into thinking drinking that bottle of potion was a beneficial event, only to later reveal that it was not.
cONcLUsION The intent of this article has been to explore a comparatively understudied area of computer game sound, chiefly that of the role of music and sound in electronic gambling machines (EGMs). We explored the structural components of EGMs and EGM games, tracing the development of technical advances that have led to progressively more enhanced audio interfaces over the past two decades. Central to this discussion is the interrelationship between EGM technology, sound and human behavioral psychology. Research has shown that standard EGM gameplay concepts like, for instance, the “near miss” and “losses disguised as wins”, coupled with enhanced sound prompts and triggers can encourage both more rapid and longer gameplay. A second correlated point in this study has been our consideration of EGM sound within the wider soundscape of a casino/bar/gaming environment.
Sound in Electronic Gambling Machines
An interesting area of research as yet unexplored is determining whether gambling behavior is affected when EGM sounds commingle with, and compete against, external sources of music, sounds, and noise. Further, it would be interesting to explore whether a correlation exists between the concurrent use of image and sound in EGMs. Specifically, to determine if EGM sound and video individually and together amplify and/or reinforce the notion of a loss disguised as a win or, conversely, if EGM sound and visuals instead worked to distract and divert gamblers’ attention away from the machine, and by extension, from the act of gambling. Early research does indicate that sound does, in fact, reinforce the idea of winning even when the player is losing. There have been no studies to explore the impact that a similar sonic process has in computer games, but this is an interesting area for future exploration. A particularly important concept that can be taken from slot machines is the idea of customization. Slot machines, as shown, have two basic markets that they cater to: arousal/action seekers, and those who seek escape/dissociation. It may be suggested that computer games have a similar audience, although this simple way of dividing players is perhaps inadequate. What does remain, however, is the concept that players have different needs for gameplay. And while some players enjoy the sounds of slot machines and the casino environment, others clearly would prefer the ability to turn down—or turn off—sound altogether. computer games, of course, have long recognized this and offered the ability to turn sound on, off, and later adjust volumes of individual elements (ambience/sound effects/dialogue/music). More recently, the option for players to insert their own preferred music into a game has furthered the ability to customize game sound. Further, some games have “boredom switches” that drop the volume levels automatically after a player has become “stuck” at a particular stage in the game. However, it might also be possible to adjust sound based on the player’s skill level and ability—with
more frequent frustration sounds being used as the player advances, for example, and greater sonic encouragement at the start of a game. Different sounds may be used when the game is being played as a one-player or in multi-player mode. Recently, with the creation of physiologically aware gaming devices such as the Wii Vitality Sensor, it has become possible to adjust in real-time based on the player’s physiological response. We believe that this area of computer gaming—what we might call “player aware” games—will become an important future area for research. In particular, it is possible to both craft sound to manipulate the player based on their physiological response, as well as to respond based on their physiological response. It might be possible, in other words, for games to “read” our emotional and physiological state and adjust music to keep us interested, to guide us to another state, or to enhance an existing state. Sound clearly plays an important role in the perception of gaming, and will continue to grow in importance as computer games search for ever-increasing ways to keep players interested.
rEFErENcEs Alloy, L., Abramson, L., & Viscusi, D. (1981). Induced mood and the illusion of control. Journal of Personality and Social Psychology, 41, 1129–1140. doi:10.1037/0022-3514.41.6.1129 Amsel, A. (1962). Frustrative nonreward in partial reinforcement and discrimination learning: Some recent history and a theoretical extension. Psychological Review, 69(4), 306–328. doi:10.1037/ h0046200 Anderson, G., & Brown, R. I. T. (1984). Real and laboratory gambling, sensation-seeking and arousal. The British Journal of Psychology, 75(3), 401–410.
15
Sound in Electronic Gambling Machines
Beentjes, J. W. J., Van Oordt, M., & Van Der Voort, T. H. A. (2002). How television commentary affects children’s judgments on soccer fouls. Communication Research, 29, 31–45. doi:10.1177/0093650202029001002 Beverland, M., Lim, E. A. C., Morrison, M., & Terziovski, M. (2006). In-store music and consumer–brand relationships: Relational transformation following experiences of (mis)fit. Journal of Business Research, 59, 982–989. doi:10.1016/j. jbusres.2006.07.001 Bolivar, V. J., Cohen, A. J., & Fentress, J. C. (1994). Semantic and formal congruency in music and motion pictures: Effects on the interpretation of visual action. Psychomusicology, 13, 28–59. Bradley, I. L. (1971). Repetition as a factor in the development of musical preferences. Journal of Research in Music Education, 19(3), 295–298. doi:10.2307/3343764 Brown, R. I. F. (1986). Arousal and sensationseeking components in the general explanation of gambling and gambling addictions. Substance Use & Misuse, 21(9), 1001–1016. doi:10.3109/10826088609077251 Bryant, J., Brown, D., Comisky, P. W., & Zillmann, D. (1982). Sports and spectators: Commentary and appreciation. The Journal of Communication, 32(1), 109–119. doi:10.1111/j.1460-2466.1982. tb00482.x Bryant, J., Comisky, P., & Zillmann, D. (1982). Drama in sports commentary. The Journal of Communication, 27(3), 140–149. doi:10.1111/j.1460-2466.1977.tb02140.x Bullerjahn, C., & Güldenring, M. (1994). An empirical investigation of effects of film music using qualitative content analysis. Psychomusicology, 13, 99–118.
16
Carter, F. A., Wilson, J. S., Lawson, R. H., & Bulik, C. M. (1995). Mood induction procedure: importance of individualising music. Behaviour Change, 12, 159–161. Chion, M. (1994). Audio-vision: Sound on screen. New York: Columbia University Press. Comisky, P. W., Bryant, J., & Zillmann, D. (1977). Commentary as a substitute for action. The Journal of Communication, 27(3), 150–153. doi:10.1111/j.1460-2466.1977.tb02141.x Condry, J., & Scheibe, C. (1989). Non program content of television: Mechanisms of persuasion . In Condry, J. (Ed.), The Psychology of Television (pp. 217–219). London: Erlbaum. Cooley, M. (1998, November). Sound + image in computer-based design: Learning from sound in the arts. Paper presented at International Community for Auditory Display Conference, Glasgow, UK. Coventry, K. R., & Constable, B. (1999). Physiological arousal and sensation seeking in female fruit machine players. Addiction (Abingdon, England), 94, 425–430. doi:10.1046/j.13600443.1999.94342512.x Coventry, K. R., & Hudson, J. (2001). Gender differences, physiological arousal and the role of winning in fruit machine gamblers. Addiction (Abingdon, England), 96, 871–879. doi:10.1046/ j.1360-0443.2001.9668718.x Crockford, D., Goodyear, B., Edwards, J., Quickfall, J., & el-Guebaly, N. (2005). Cue-Induced brain activity in pathological gamblers. Biological Psychiatry, 58(10), 787–795. doi:10.1016/j. biopsych.2005.04.037 Csíkszentmihályi, M. (1990). Flow: The psychology of optimal experience. New York: HarperPerennial.
Sound in Electronic Gambling Machines
Delfabbro, P., Fazlon, K., & Ingram, T. (2005). The effects of parameter variations in electronic gambling simulations: Results of a laboratorybased pilot investigation. Gambling Research: Journal of the National Association for Gambling Studies, 17(1), 7–25. Dibben, N. (2001). What do we hear, when we hear music? Music perception and musical material. Musicae Scientiae, 2, 161–194. Dickerson, M., & Adcock, S. (1987). Mood, arousal and cognitions in persistent gambling: Preliminary investigation of a theoretical model. Journal of Gambling Behaviour, 3(1), 3–15. doi:10.1007/BF01087473 Dixon, L., Trigg, R., & Griffiths, M. (2007). An empirical investigation of music and gambling behaviour. International Gambling Studies, 7(3), 315–326. doi:10.1080/14459790701601471 Dixon, M., Harrigan, K. A., Sandhu, R., Collins, K., & Fugelsang, J. (2011: In press). Slot machine play: Psychophysical responses to wins, losses, and losses disguised as wins. Addiction. Dreher, R. E. (1947). The relationship between verbal reports and the galvanic skin response. Journal of Abnormal and Social Psychology, 44, 87–94. Dretzka, G. (2004, December 12). Casinos, celebrities bet on our love for pop culture icons. Seattle Times. Retrieved July 15, 2009, from http:// community.seattletimes.nwsource.com/archive/? date=20041212&slug=casinos12. Edworthy, J., Loxley, S., & Dennis, I. (1991). Improving auditory warning design: relationship between warning sound parameters and perceived urgency. Human Factors, 33, 205–231. Effrat, J., Chan, L., Fogg, B. J., & Kong, L. (2004). What sounds do people love and hate? Interaction, 11(5), 64–66. doi:10.1145/1015530.1015562
Ferrari, M., & Ives, S. (2005). Slots: Las Vegas gamblers lose some $5 billion a year at the slot machines alone. Las Vegas: An unconventional history. New York: Bulfinch. Gaboury, A., & Ladoucer, R. (1989). Erroneous perceptions and gambling. Journal of Social Behavior and Personality, 4(41), 111–120. Garlin, F. V., & Owen, K. (2006). Setting the tone with the tune: A meta-analytic review of the effects of background music in retail settings. Journal of Business Research, 59, 755–764. doi:10.1016/j. jbusres.2006.01.013 Glass, D. C., & Singer, J. E. (1972). Urban stress. New York: Academic. Griffiths, M., & Parke, J. (2005). The psychology of music in gambling environments: An observational research note. Journal of Gambling Issues, 13. Retrieved July 15, 2009, from http://www. camh.net/egambling/issue13/jgi_13_griffiths_2. html. Griffiths, M. D. (1990). The cognitive psychology of gambling. Journal of Gambling Studies, 6(1), 31–42. doi:10.1007/BF01015747 Haas, E. C., & Edworthy, J. (1996). Designing urgency into auditory warnings using pitch, speed and loudness. Computing and Control Engineering Journal, 7, 193–198. doi:10.1049/cce:19960407 Harrigan, K. A. (2009). Slot machines: Pursuing responsible gaming practices for virtual reels and near misses. International Journal of Mental Health and Addiction, 7(1), 68–83. doi:10.1007/ s11469-007-9139-8 Harrigan, K. A., & Dixon, M. (2009). PAR sheets, probabilities, and slot machine play: Implications for problem and non-problem gambling. Journal of Gambling Issues, 23, 81–110. doi:10.4309/ jgi.2009.23.5
17
Sound in Electronic Gambling Machines
Hébert, S., Béland, R., & Dionne-Fournelle, O. (2005). Physiological stress response to videogame playing: the contribution of built-in music. Life Sciences, 76, 2371–2380. doi:10.1016/j. lfs.2004.11.011 Hirokawa, E. (2004). Effects of music, listening, and relaxation instructions on arousal changes and the working memory task in older adults. Journal of Music Therapy, 41(2), 107–127. Hirsch, A. R. (1995). Effects of ambient odors on slot-machine usage in a Law Vegas casino. Psychology and Marketing, 12(7), 585–594. doi:10.1002/mar.4220120703 Hopson, J. (2001). Behavioral game design. Gamasutra. Retrieved October 23, 2009, from http://www.gamasutra.com/view/feature/3085/ behavioral_game_design.php. Iwamiya, S. (1994). Interaction between auditory and visual processing when listening to music in an audio visual context. Psychomusicology, 13, 133–154. Jackson, D. (2003). Sonic branding: An introduction. New York: Palgrave/Macmillan. doi:10.1057/9780230503267 King, D., Delfabbro, P., & Griffiths, M. (2009). Video game structural characteristics: A new psychological taxonomy. International Journal of Mental Health and Addiction, 8(1), 90–106. doi:10.1007/s11469-009-9206-4 Kranes, D. (1995). Play grounds. Gambling: Philosophy and policy [Special Issue]. Journal of Gambling Studies, 11(1), 91–102. doi:10.1007/ BF02283207 Ladouceur, R., & Sévigny, S. (2005). Structural characteristics of video lotteries: Effects of a stopping device on illusion of control and gambling persistence. Journal of Gambling Studies, 21(2), 117–131. doi:10.1007/s10899-005-3028-5
18
Langer, E. J. (1975). The illusion of control. Journal of Personality and Social Psychology, 32, 311–328. doi:10.1037/0022-3514.32.2.311 Lastra, J. (2000). Sound technology and the American cinema: Perception, representation, modernity. New York: Columbia University Press. Livingstone, C., Woolley, R., Zazryn, T., Bakacs, L., & Shami, R. (2008). The relevance and role of gaming machine games and game features on the play of problem gamblers. Adelaide: Independent Gambling Authority of South Australia. Lucas, G. (Director). (1977). Star Wars [Motion picture]. Los Angeles, CA: 20th Century Fox. (2002). Lucky Larry’s Lobstermania [Computer game]. Reno, NV: IGT. Marmurek, H. H. C., Finlay, K., Kanetkar, V., & Londerville, J. (2007). The influence of music on estimates of at-risk gambling intentions: An analysis by casino design. International Gambling Studies, 7(1), 113–122. doi:10.1080/14459790601158002 Mattilaa, A. S., & Wirtz, J. (2001). Congruency of scent and music as a driver of in-store evaluations and behavior. Journal of Retailing, 77, 273–289. doi:10.1016/S0022-4359(01)00042-2 McCraty, R., Barrios-Choplin, B., Atkinson, M., & Tomasino, D. (1998). The effects of different types of music on mood, tension and mental clarity. Alternative Therapies in Health and Medicine, 4, 75–84. Muzak Corporation. (n.d.). Why Muzak. Retrieved October 5, 2009, from http://music.muzak.com/ why_muzak. Nacke, L., & Grimshaw, M. (2011). Player-game interaction through affective sound . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global.
Sound in Electronic Gambling Machines
Oguro, C. (2009). The greatest Easter eggs in gaming. Gamespot. Retrieved October 5, 2009, from http://www.gamespot.com/features/6131572/ index.html. Owen, D. (2006, April 10). The soundtrack of your life: Muzak in the realm of retail theatre. The New Yorker. Retrieved October 5, 2009, from http://www.newyorker.com/ archive/2006/04/10/060410fa_fact. Parke, J., & Griffiths, M. (2006). The psychology of the fruit machine: The role of structural characteristics (Revisited). International Journal of Mental Health and Addiction, 4, 151–179. doi:10.1007/s11469-006-9014-z Pitzen, L. J., & Rauscher, F. H. (1998, May). Choosing music, not style of music, reduces stress and improves task performance. Poster presented at the American Psychological Society, Washington, DC. Productions, K. S. K. (n.d.). Cinematic & Muzak. Retrieved October 20, 2009, from http://www.kskproductions.nl/en/services/cinematic-a-muzak. Rivlin, G. (2004, May 9). The tug of the newfangled slot machines. New York Times. Retrieved July 15, 2009, from http://www.nytimes. com/2004/05/09/magazine/09SLOTS.html. Rohner, S. J., & Miller, R. (1980). Degrees of familiar and affective music and their effects on state anxiety. Journal of Music Therapy, 17, 2–15. Schull, N. D. (2005). Digital gambling: The coincidence of desire and design. The Annals of the American Academy of Political and Social Science, 597, 65–81. doi:10.1177/0002716204270435 Scott, T. (Director). (1986). Top Gun [Motion picture]. Hollywood, CA: Paramount Pictures.
Seeking Alpha, “The Video Game Industry: An $18 Billion Entertainment Juggernaut” August 05, 2008 http://seekingalpha.com/article/89124the-video-game-industry-an-18-billion-entertainment-juggernaut. Sharpe, L. (2004). Patterns of autonomic arousal in imaginal situations of winning and losing in problem gambling. Journal of Gambling Studies, 20, 95–104. doi:10.1023/ B:JOGS.0000016706.96540.43 Skea, W. H. (1995). “Postmodern” Las Vegas and its effects on gambling. Journal of Gambling Studies, 11(2), 231–235. doi:10.1007/BF02107117 Smith, C. A., & Morris, L. W. (1976). Effects of stimulative and sedative music on cognitive and emotional components of anxiety. Psychological Reports, 38, 1187–1193. Sullivan, D. B. (1992). Commentary and viewer perception of player hostility: Adding punch to televised sports. Journal of Broadcasting & Electronic Media, 35, 487–504. (1995). Super Mario Bros [Computer game]. Redmond, WA: Nintendo. Surman, D. (2007). Pleasure, spectacle and reward in Capcom’s Street Fighter series . In Krzywinska, T., & Atkins, B. (Eds.), Videogame, player, text (pp. 204–221). London: Wallflower. 7th guest [Computer game]. (1993). Trilobyte (Developer). London: Virgin Games. Thayer, J. F., & Levenson, R. W. (1983). Effects of music on psychophysiological responses to a stressful film. Psychomusicology, 3(1), 44–52. The adventures of Rocky and Bullwinkle [Computer game]. (1992). Radical Entertainment (Developer). Agoura Hills, CA: THQ. The Flintstones. (1991). The rescue of Dino & Hoppy [Computer game]. Vancouver, Canada: Taito Corporation.
19
Sound in Electronic Gambling Machines
The Jetsons. (1992). Cogswell’s caper! [Computer game]. Vancouver, Canada: Taito Corporation. Toneatto, T., Blitz-Miller, T., Calderwood, K., Dragonetti, R., & Tsanos, A. (1997). Cognitive distortions in heavy gambling. Journal of Gambling Studies, 13, 253–261. doi:10.1023/A:1024983300428 Too human [Computer game]. (2008). Silicon Knights (Developer). United States: Microsoft Game Studios. Traxel, W., & Wrede, G. (1959). Changes in physiological skin responses as affected by musical selection. Journal of Experimental Psychology, 16, 57–61. Tsukahara, N. (2002). Game machine with random sound effects. U.S. Patent No. 6,416,411 B1. Washington, DC: U.S. Patent and Trademark Office. Turner, N., & Horbay, R. (2004). How do slot machines and other electronic gambling machines actually work? Journal of Gambling Issues, 11. Westermann, C. F. (2008). Sound branding and corporate voice: Strategic brand management using sound. Usability of speech dialog systems: Listening to the target audience. Berlin: SpringerVerlag. (1990). Wing commander [Computer game]. Austin, TX: Origin Systems. Wolfson, S., & Case, G. (2000). The effects of sound and colour on responses to a computer game. Interacting with Computers, 13, 183–192. doi:10.1016/S0953-5438(00)00037-0 Yalch, R. F., & Spangenberg, E. R. (2000). The effects of music in a retail setting on real and perceived shopping times. Journal of Business Research, 49, 139–147. doi:10.1016/S01482963(99)00003-X
20
Yamada, M. (2009, September). Can music change the success rate in a slot-machine game? Paper presented at the Western Pacific Acoustics Conference, Bejing, China. You don’t know jack [Computer game]. (1995). Berkeley Systems/Jellyvision (Developer). Fresno, CA: Sierra On-Line.
KEY tErMs AND DEFINItIONs Acoustic Frustration: The use of sound to antagonize a player, creating a short-term sense of frustration that, it has been suggested, prolongs the play period. Electronic Gambling Machines: EGMs, also known as slot machines, video slots, or video fruit machines are digital, electronic slot machines. They tend to be much faster than electric or mechanical slots, with an increased number of play options and bonuses. Galvanic Skin Response: GSR: one component of electrodermal response, also known as skin conductance response or sweat response, is an affordable and efficient measurement of simple changes in arousal levels—one of the reasons why it is the main component of a polygraph device. Essentially, GSR measures the electrical conductivity of the skin, which changes in resistance due to psychological states. Losses Disguised as Wins: A play in which the player “wins” but receives a payout amount of money less than that of the amount wagered, hence actually losing on the wager despite being convinced (sonically) that they have, in fact, won. Near Miss: A failure that was close to a win— such as two matching icons arriving on the payline followed by a third reel whose icon sits just off the pay-line. Slot machine manufacturers use this concept to create a statistically unrealistically high number of near misses (Harrigan 2009), which convinces the player that they are close to win-
Sound in Electronic Gambling Machines
ning, and therefore leads to significantly longer playing times (Parke & Griffiths, 2006). Reward Schedule: A schedule of pay-off or rewards tied to timings or game actions, resulting in a series of emotional peaks and valleys to keep a player interested in a game. Rolling Sound: The music or sound effects that are played when a player wins a round on a slot machine. The length of the sound (its roll) is tied to the amount of the win, with longer sounds rolling for longer times.
2
3
4
ENDNOtEs 1
It is a common practice for many avid slot machine gamers to play multiple, adjacent machines simultaneously. Further, activities like drinking, smoking and interaction with
5
other gamblers and passersby may also take gamers’ attention away from the machines. For instance, a reward schedule is built into Too Human. Personal conversation, Denis Dyack of Silicon Knights, St. Catherines, Ontario, 2008. See Hopson, 2001. There are different versions of the game available, including a “progressive slot” with varying jackpots, a 25-line slot with a max bet of 1,250 credits and a payout of 500,000 credits. Thanks to the anonymous reviewer of the chapter for this idea. Commission on Behavioral and Social Sciences and Education Committee on the Social and Economic Impact of Pathological Gambling. (1999). Committee on Law and Justice. Commission on Behavioral and
21
22
Chapter 2
Sound for Fantasy and Freedom Mats Liljedahl Interactive Institute, Sonic Studio, Sweden
AbstrAct Sound is an integral part of our everyday lives. Sound tells us about physical events in the environment, and we use our voices to share ideas and emotions through sound. When navigating the world on a day-to-day basis, most of us use a balanced mix of stimuli from our eyes, ears and other senses to get along. We do this totally naturally and without effort. In the design of computer game experiences, traditionally, most attention has been given to vision rather than the balanced mix of stimuli from our eyes, ears and other senses most of us use to navigate the world on a day to day basis. The risk is that this emphasis neglects types of interaction with the game needed to create an immersive experience. This chapter summarizes the relationship between sound properties, GameFlow and immersive experience and discusses two projects in which Interactive Institute, Sonic Studio has balanced perceptual stimuli and game mechanics to inspire and create new game concepts that liberate users and their imagination.
INtrODUctION At the Interactive Institute, Sonic Studio in Piteå, Sweden, we do research on sound and auditory perception in order to find new ways to use sound, new contexts where sound can be utilized, and new applications for sound in general. Of special interest to us is how sound resembles and differs from other sensory stimuli and how this can be DOI: 10.4018/978-1-61692-828-5.ch002
put to play. In our work we use perspectives and methods from art, science, and technology and we utilize digital technology as a vehicle for our ideas and experiments. In a series of projects we have explored intuitive, emotional, imaginative, and liberating properties of sound. These projects have resulted in new insights and knowledge as well as in new and innovative applications for sound, audio, and technology. In this chapter I will describe our perspective on a number of sound properties and how
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Sound for Fantasy and Freedom
we have put these to work in various ways. The projects are based on and inspired by an ecologic and everyday-listening approach to sound, like the ones proposed by R. Murray Shafer, William Gaver, and their followers. As human beings, we are good at interpreting the soundscape constantly surrounding us. When we hear a sound we can make relatively accurate judgments about the objects involved in generating the sound, their weight, the materials they are made of, the type of event or series of events that caused the sound, the distance and direction to the sound source, and the environment surrounding the sound source and the listener, for example. Much of the existing research on sound and auditory perception is about how to convey clear and unambiguous information through sound. In computer games, however, the aim is also to create other effects, effects that have as much to do with emotions, the subconscious, intuition, and immersion as they do with clear and unambiguous messages. This article describes a couple of projects in which we have worked with the balance between eye and ear, between ambiguity and un-ambiguity, between cognition and intuition and between body and mind. The aim has been to create experiences built on a multitude of human abilities and affordances, mediated by new media technology. In a traditional computer game setting, the TV screen or computer monitor is the center of attention. The screen depicts the virtual game world and the player uses some kind of input device, such as a game pad, a mouse, a keyboard, or a Wiimote, to remotely control the virtual gameworld and objects and creatures in it. The action takes place in the virtual world and the player is naturally detached from the game action by the gap between the player’s physical world and the virtual world of the game. Much work has to be done and complex technology used in order to bridge that gap and to have the player experience a sense of presence in the virtual gameworld. The aim is to make the player feel as immersed as possible in the game experience and to make her suspend her natural
disbelief. To achieve this, the computer game industry must build broader and broader bridges over the reality gap to make the virtual game reality more immersive. The traditional way to increase immersion and suspension of disbelief has primarily been to increase graphics capability and, today we can enjoy near photo-realistic, 3D-graphics in real time. But there might be alternative ways to tackle the problem. Potentially, computer games could be more engaging and immersive without having to build long and broad bridges over the reality gap. What about narrowing the gap instead of building broader bridges over it?
bAcKGrOUND Sound and light work in different ways and reach us on complementary channels. Our corresponding input devices, the visual and auditory perceptions, show both similarities and differences and we have an innate ability to experience the world around us by combining the visual, auditory, touch and olfactory perceptions into one, multimodal whole. We are built for and used to handling the world through a balanced mix of perceptual input from many senses simultaneously. This can be exemplified in different ways. One is by crossmodal illusions, for example, the McGurk effect (Avanzini, 2008, p. 366) which shows how our auditory perception is influenced by what we see. Another example is the ventriloquist illusion in which the perceived location of a sound shifts depending on what we see (O’Callaghan, 2009, section 4.3.1). If the signal on one sensory channel is weak, we more or less automatically fill in the gaps with information from other channels and, in this way, we are able to interpret the sum of sensory input and make something meaningful of that sum. Watching lip movements in order to hear what your friend is saying at a noisy party is just one everyday example of this phenomenon. A third example is Stoffregen and Bardy’s concept of “global array” (Avanzini, 2008, p. 350).
23
Sound for Fantasy and Freedom
According to this concept, observers are not separately sensitive to structures in the optic and acoustic flows, but are rather sensitive to patterns that extend across these flows: the global array. Another way to describe this is that we do not “see and hear” but rather “see-hear”, what we perceive is the sum of sensations reaching our different modalities. What we really hear, what a sound is, where a sound is located and so forth are questions that philosophers have been arguing over for several hundreds of years. O’Callaghan (2009) gives a broad summary of the history and current state of the field. What most philosophers and sound researchers agree on is that sounds are the result of events in the physical world. Sound holds information about these events and the objects involved in them. This means that to our perception, sounds are strongly linked to the physical world and we are “hard wired” to treat sounds as tokens of physical activity, matter in motion and matter in interaction. In this context the pioneering work of William Gaver (1993) on sound classification and listening modes is still often cited and relevant for game sound design. Gaver makes a distinction between musical listening and everyday listening. In musical listening, you listen to the acoustic properties of the sound, for example, its pitch, loudness, and timbre. In everyday listening, on the other hand, you listen to events rather than sounds. When you hear a car passing by or you hear a bottle breaking you do not pay much attention to pitch or loudness but more to the event as such. In everyday listening, the interpretation and the mapping of sounds to the individual’s previous experiences and memories are crucial. When a bottle crashes against the floor, loses its original shape and turns into a number of smaller and larger pieces, this is immediately obvious to the eye. But, in order to be able to pinpoint the event that caused the sound of the broken bottle, the ear has to learn and form a memory that connects the sound of broken glass to the event of a bottle crashing and losing
24
its shape. Even if the individual has a previous experience and memory that connects the sound of broken glass to the event of a broken bottle, the ear, not knowing the exact cause of the sound, might hesitate. Was it a bottle that crashed or was it perhaps a large drinking glass that broke? The eye can give the correct answer, whereas the ear is left to interpret and to guess in various degrees. Tuuri, Mustonen, and Pirhonen (2007) have continued along this path and propose a hierarchical scheme of listening modes. Two of these are preconscious, two are source-oriented, three are context-oriented and one is quality-oriented. In the two preconscious-oriented listening modes, the focus is on what reflexive, emotive and associative responses a sound evokes in the listener. In the two source-oriented modes, the focus is on how the listener perceives the source of a sound and what event caused it. In the three context-oriented modes, the focus is on whether the sound had a specific purpose, if it represents any symbolic or conventional meaning, and if the sound in that case was suitable and understandable in the context. In the last, quality-oriented listening mode, the focus is on the acoustic properties of the sound, its pitch, loudness, duration and so forth. To use these or other, complementary, identified listening modes is a powerful way to inform the sound design process of not only computer games, but sound design processes in general. The important thing to notice here is that research on listening modes in general shows that sound can indeed be used to evoke emotions and associations, to communicate properties of physical objects and events and to convey meaning and purpose. Already from the time before we are born our auditory perception starts giving us information about the world around us (Lecanuet, 1996). From day one we start building our library of associations to individual sounds and to whole soundscapes. Gradually, we learn what they mean and we train our ability to interpret them. Furthermore, some researchers argue that we experience sounds “as of” a bigger whole. O’Callaghan (2009) argues
Sound for Fantasy and Freedom
that the sound of hooves of a galloping horse is not identical with the galloping. Instead, it is part of the particular event of galloping: “Auditory perceptual awareness as of the whole [sic] occurs in virtue of experiencing the part” (part 2.3.2). This strong linkage between the sounds we hear and the physical world we inhabit can be brought into play in computer games through rich soundscapes in order to convey information about objects, environments, and events in the game world. Try this simple experiment. Pick an environment with a reasonable number of activities, people, birds, machines or whatever you can find that makes everyday sounds. Close your eyes and try not to interpret, make associations and create mental pictures from what you hear. It is very hard to put the auditory interpreter to rest and this is true for sounds from all types of sources, including the headphones playing sounds from your iPod. This interpretation, mapping, or disambiguation of individual sounds and whole soundscapes involves high-level mental processes related to our conscious and subconscious, cognitive and emotional layers. As such, these processes have the potential to invoke a myriad of physical and mental responses: fear, flight, well-being, happiness, anger, understanding and so on. In computer game design, this means huge potential to both convey cognitive meaning and to create moods and affect. Auditory perception can be understood as becoming aware of the whole by virtue of the parts. Sounds can also be said to be more ambiguous and leave wider space for interpretation than visual stimuli do, at least when it comes to interpreting where and what we have heard. In Human Computer Iinterface (HCI) contexts, ambiguity has often been thought of in terms of disadvantage and problem (Sengers & Gaver, 2006) and much, perhaps even a majority, of the research done in the field has tried to overcome this and find ways to create clear and unambiguous systems and interfaces. Research on sound interaction design is no exception to this, as described by, for example,
Gaver (1997). This is true also when it comes to sound in computer games but, in this context, the need to interpret and disambiguate the computer game system is not the only aspect of the issue. On the contrary, some authors argue that ambiguity and the need to interpret a system instead can be used as an asset (Sengers & Gaver, 2006; Sengers, Boehner, Mateas, & Gay, 2007). Here, we argue that this is certainly the case. When the ideas of ambiguity and interpretation are combined with the concepts of flow and GameFlow described below, the sum can be used to inform the game design process in new ways. Development of computer games has so far mostly been geared towards vision. When it comes to sound in games, much of the work is inspiring case studies but less research. Sweetser and Wyeth list three aspects of usability in games that have previously been in focus for research (Sweetser & Wyeth, 2005). These are interface (controls and display), mechanics (interacting with the gameworld), and gameplay (problems and challenges). Lately, also other dimensions of the design and use of computer games have started to gain interest among game researchers, dimensions that incorporate new and more complex aspects and ideas of player enjoyment and computer game design. Several research groups have, for example, made connections between interactivity in general and, more specifically, player enjoyment in games on the one hand and the concept of flow developed by Mihaly Csíkszentmihályi on the other. In the 1970s and 1980s Csíkszentmihályi conducted extensive research into what makes experiences enjoyable. He found that optimal experiences are the same all over the world and can be described in the same terms regardless of who is enjoying the experience. He called these optimal experiences flow. A flow experience is defined by Csíkszentmihályi (1990) as being “so gratifying that people are willing to do it for its own sake, with little concern for what they will get out of it, even when it is difficult, or dangerous” (p. 71).
25
Sound for Fantasy and Freedom
Judging from the volume and type of work built on and derived from Csíkszentmihályi’s flow principle, it can be argued that the concept is relevant in the context of computer games. Andrew Polaine (2005) has written about The Flow Principle in Interactivity. This work does not relate to computer games per se, but is closely related to the subject in that it connects flow with both “willing suspension of disbelief” (a term borrowed from narratives in theater and film) and the experience of play. The GameFlow model developed by Sweetser and Wyeth builds directly on the concept of flow and is a model for evaluating computer games from an enjoyment perspective. Another example is Kalle Jegers’ (2009) “Pervasive GameFlow” model that takes Sweetser’s and Wyeth’s GameFlow concept to the pervasive game arena. A final example is Cowley, Charles, Black, and Hickey’s (2008) USE model (User, System, Experience) that looks at games, player interaction, and flow from an information system perspective. Built on the Flow concept, Sweetser and Wyeth’s GameFlow model consists of eight elements for achieving enjoyment in games. The model can be used both when designing new games and when evaluating existing game concepts. In summary, according to Sweetser and Wyeth, games must keep the player concentrated through a high workload. At the same time, the game tasks must be sufficiently challenging and match the skill level of the player. The game tasks must have clear goals and the player must be given clear feedback on progression towards these goals. Enabling deep yet effortless involvement in the game can potentially create immersion in the game. According to Sweetser and Wyeth, experiences can be immersive if they let us concentrate on the task of the game without effort. “Effortlessly” can, in this context, be interpreted in several ways: one way to think about it is in terms of how true to real life a gaming experience is and how transparent the interaction with the game creating the experience it is. How the GameFlow model can
26
be used in sound design for games is covered in more detail below. A number of research projects report on sound and audio’s ability to create rich, strong and immersive experiences using mobile platforms that give physical freedom to the users. These projects also support the general idea that sound and audio are well suited for use in the design of computer game experiences based on the GameFlow model. Reid, Geelhoed, Hull, Cater, and Clayton report on a public, location-based audio drama called Riot 1831. The evaluation of the project showed that a majority of the users had rich and immersive experiences created from the sounds of an audio-based narrative. Based on the results from this project, the authors argue that “immersion is a positive determinant for enjoyment (and vice versa)” (Reid, Geelhoed, Hull, Cater, & Clayton, 2005). It should be noted that the drama took place in a square in Bristol, UK, which gives this project similarities to pervasive and location-based games where the virtual gameworld and the physical world of the player are blended. Friberg and Gärdenfors (2004) report on a project in which three audiobased games (what the authors term TiM games) were developed. Based on audio communication with the users, the authors report that these games give the users spatial freedom, encourage physical activity and open up possibilities to create new types of interfaces for input to the game. Ekman et al. report on the development of a game for a mobile platform (Ekman et al., 2005). They point out that sound and audio can indeed be used to create immersion, but also that the use of sound does not automatically create immersion. Great care must be taken when designing the game sounds and the developers must also carefully select the best technology and equipment to play back the game audio to get the desired effects. In two projects, the Interactive Institute’s Sonic Studio has investigated how sound in games can be used to bring the user’s own fantasy into play to create new gaming experiences (Liljedahl, Lindberg, & Berg, 2005; Liljedahl, Papworth, &
Sound for Fantasy and Freedom
Lindberg, 2007). In both these projects the balance point between visible and audible stimuli from the game has been moved away from the visual and towards the audible. In both cases the users of the computer games are given only a minimum of visual information and are, instead, given rich and varied soundscapes. The projects have shown that the users have had rich and immersive gaming experiences and are given other types and amounts of freedom compared to more traditional computer games. These projects will be described in more detail later in this chapter. Humankind has, in recent centuries, invested considerable energy and creativity in creating complex technology. We have a long tradition in replacing human capability with machinery. In the early days it was mostly muscle power that was mimicked, replaced, and superseded by steam, combustion, and, later, electricity. It can be argued that research into artificial intelligence is striving to do the same with human cognitive and emotional capabilities. Following this long tradition, it seems that we often neglect human capabilities, affordances, gifts, and needs when designing computer games and other systems. Much of the focus has been on creating photorealistic 3D-environments in real time and less on how the players’ internal, fantasy-driven, “sound interpreter and mapper” can be put into play to create complementary, mental images. In the following I will describe how we at Interactive Institute, Sonic Studio work with finding ways to increase user satisfaction and involvement in gaming situations by using existing technology in slightly new ways. Often, this has meant moving complexity from technology to the user, decreasing the demands on technology used, and increasing the demands on the user to invest and spend energy physically and mentally in a game experience.
MIND tHE GAP—sOUND FOr FEEDbAcK AND IMMErsION Pictures are not the real world; they are merely the shadows of it. René Magritte’s provoking pipe is a painting about exactly this: the picture of a pipe and beneath it the text “Ceci n’est pas une pipe” (This is not a pipe). We are surrounded by still and moving images and we are used to treating pictures as pictures and not the real, physical world. Even the most violent computer games and Hollywood film productions are assumed to be physically and mentally non-hazardous to us just because we are supposed to be able to discriminate between reality and the fictive picture of it. Sound, on the other hand, seems to work slightly differently. When striving for engagement, immersion, and suspension of disbelief in computer games and films, sound, plays a very prominent role and, according to Parker and Heerema (2007), “sound is a key aspect of a modern video game”. Natural sounds in the physical world are the result of events in that world and we become aware of physical events to a large degree through sound. It can thus be argued that sound is a strong link to the physical world. In fact, Gilkey and Weisenberger argue that “…an inadequate, incomplete or nonexistent representation of the auditory background in a VE [Virtual Environment] may compromise the sense of presence experienced by users” (quoted in Larsson, Västfjäll, & Kleiner, 2002). It is this mechanism that is utilized when creating the sound tracks to films and games. Just seeing Donald Duck smash into a wall is not enough. It is not until the sound effect is added that the nature and the full consequence of the smash are made evident to the audience. When we hear the sound of the smash, all of us have our own, slightly unique, experiences of and relationship to the sound. The sound has the power to immediately trigger our interpretation machinery and evoke memories and fantasies. In a fraction of a second the sound makes us re-live our own experiences and we can feel what Donald feels:
27
Sound for Fantasy and Freedom
pain, anger, and humiliation. In this way it can be argued that the sound is playing us. Like a guitarist plucking a string that generates sound, sound is plucking our interpretation, spawning memories, understanding, and emotions. The string cannot stop the guitarist from plucking it and we cannot stop sound from triggering our understanding, our memories, associations, and emotions. For a computer game to be successful it is crucial that the players can immerse themselves in the gaming experience and that they are invited to a gameworld and game experience in which they are willing to suspend their natural disbelief. After all, World of Warcraft is not the real world. In their GameFlow concept, Sweetser and Wyeth (2005) set up a number of criteria that game designers and game researchers can use when designing and evaluating games with respect to immersion and suspension of disbelief. Some of these criteria are general, overarching principles that relate to many human activities, while other criteria relate more closely to gaming and the media used to convey the game’s metaphor and narrative. The GameFlow model lists the following criteria for player enjoyment in games:
•
Concentration. Games should require concentration and the player should be able to concentrate on the game Challenge. Games should be sufficiently challenging and match the player’s skill level Player Skills. Games must support player skill development and mastery Control. Players should feel a sense of control over their actions in the game Clear Goals. Games should provide the player with clear goals at appropriate times Feedback. Players must receive appropriate feedback at appropriate times Immersion. Players should experience deep but effortless involvement in the game
•
•
•
• • • • •
28
Social Interaction. Games should support and create opportunities for social interaction (Sweetser & Wyeth, 2005).
As can be seen from the list, these criteria are very general and could be applied to many aspects of life, from children’s play to high school education, working life, and leisure. When it comes to sound design for computer games, some of these criteria are more relevant than others. When looking at Tuuri, Mustonen, and Pirhonen’s (2007) hierarchical listening modes, a clear link to the GameFlow concepts Feedback and Immersion criteria can be found. Sweetser and Wyeth divide the Feedback criterion into the following parts: • • •
Players should receive feedback on their progress towards their goals Players should receive immediate feedback on their actions Players should always know their status or score (Sweetser & Wyeth, 2005).
The Immersion criterion is similarly divided into the following parts:
• • • •
Players should be less aware of their surroundings Players should be less self-aware and less worried about everyday life or self Players should experience an altered sense of time Players should feel emotionally involved in the game Players should feel viscerally involved in the game (Sweetser & Wyeth, 2005).
Given our ability to listen on several cognitive abstraction levels, as indicated by Tuuri, Mustonen, and Pirhonen’s hierarchical listening modes, it can be argued that sound is well suited to communicate feedback to the user and to substantially add to the game’s ability to immerse the player in the gaming experience. In the following
Sound for Fantasy and Freedom
we will look at how sound can be used and what sound properties could be brought into play in order to give immediate and continuous feedback to users, to help them become less aware of their surroundings and themselves, and to help them get involved in the game.
sOUND PrOPErtIEs At YOUr DIsPOsAL There are a number of properties of sound as a physical, acoustic phenomenon that, in conjunction with the inherent workings of our auditory perception and our ability to use different listening modes, are at our disposal to use, explore, and exploit when designing computer game experiences. Most of these properties are well known in everyday contexts and most people will immediately be able to connect to the descriptions of them, have own experiences of them and to understand the implications of them. These properties can, of course, be described in physical and acoustic terms of frequency, amplitude, overtone spectrum, envelopes and so forth. Unfortunately these terms say very little about our human experiences of sounds, sound sources, and soundscapes. It is therefore important to also describe sound properties in relation to how our hearing works. The following is a summary of what we have discussed above and an attempt to start making the discussion more concrete and applicable to sound design for computer games.
Omni-Directionality Sound is omni-directional and reaches our ears from all directions (almost) simultaneously. Actively and consciously, as well as automatically and pre-consciously, we use this omni-directionality to navigate in our everyday lives. Even though we do not have to look out for saber-toothed tigers anymore, we are constantly warned for cars and buses from left and right, falling trees from
behind, and other dangers. Our ears are under a constant bombardment of auditory input from all directions and we cannot simply turn away from a sound. To be able to handle all this information and to avoid fatigue and sensory overload, we handle most of the input subconsciously. Luckily, we also have the ability to focus on specific parts in the soundscape. We can, for example, isolate a conversation with a friend in a noisy restaurant from a dozen nearby, unrelated conversations. This is often referred to as “the cocktail party problem” (Bregman, 1990, p. 529). In GameFlow terms, the omni-directional qualities of sound relate to both feedback and immersion. Sound for feedback from a game does not force the user to look at a special location on a screen: in fact, it does not require a screen at all. Sound is a strong carrier of emotions, events, and objects, as discussed above. In our everyday lives, we are also used to being surrounded by sound. Mimicking this in a computer game scenario can make profound contributions to the immersive qualities of the game.
Uninterruptible Along the same line is the fact that we do not have “earlids” and cannot just shut our ears to get rid of the sounds around us or choose to hear just one of the sounds of the total mass that reaches our ears. From an evolutionary point of view, it has been an advantage to get early warnings and hear all dangers, not only the dangers you choose to listen to, but all. It also means that our eyes and our ears are designed differently and that the streams of sensory input from those senses complement and interact with each other. Again, this means that a constant stream of input data must be handled. The way to cope with this is to do it subconsciously. In our everyday lives we are submerged in the ever-present stream of sounds from the world surrounding us. By supplying a relevant and welldesigned stream of sounds from a computer game, the users can get constant and natural feedback
29
Sound for Fantasy and Freedom
on their actions, very much like in real life. This in turn adds in a natural way to the sought-after effortless immersion.
sound connects to the Physical World Sound connects you to the physical world by telling about physical objects and events that involve physical objects. We can be described as hardwired to perceive and automatically interpret sounds as results of events occurring in the physical world. This is true even if the sound is mediated through a loudspeaker: our internal interpreter does not make much difference between sounds from a physical coffee cup being placed on a table and the recorded sound of the same event played back through a pair of headphones as long as the technical quality is sufficient. It is still a coffee cup being placed on a table. As with the real-world example you were asked to listen to above, try listening to a film with your eyes shut. It is virtually impossible to turn off the flow of images, feelings and associations flowing through you as you listen. You have to concentrate very hard on something else not to be affected by the sounds that reach your ears. The sound of a dentist’s drill gives a direct bodily sensation and you can almost feel the drill in your own mouth. The picture of the drill alone, without the sound, does not have the same power over our imagination, emotions, and physiology. Again, sound can be used to immerse the user in the gameworld in a way that strongly resembles the way we handle and work in everyday life.
sound can be Ambiguous We constantly hear sounds from all directions and, to some degree, we can decide the direction and the distance to the sound source. At will, we can consciously filter out discrete sounds of special interest to us from the whole soundscape around
30
us, but we are also forced to process most of what we hear subconsciously. Often, we do not know exactly what the source of a sound is or from what direction and distance it comes. We can hear a vehicle approaching from behind but have to guess what type of vehicle it is and how fast it is approaching. We can roughly tell if it is a truck or a car and make educated guesses about when it will pass us, but usually not more than that. Sound leaves a relatively large space within which we can (or are forced to) fill out the details ourselves and make assumptions and interpretations based on our individual memories, experiences and associations. When telling stories, making films or designing computer games, this ambiguity can be of great value. By planting a well-designed sound at the right moment, you can trigger a person’s imaginative and emotive mechanisms by forcing her to consciously or subconsciously interpret and disambiguate the sound. Leaving the user space open to her own interpretation, inviting her and giving her the freedom to use her own imagination can potentially help the user to be emotionally and viscerally involved in the game.
sound reaches us on subconscious channels Our ears are constantly capturing the soundscape around us. If all that data were to be processed by the cognitive and conscious layers in our brains, we would either suffer from mental overload or have another brain constitution. But thanks to the limited bandwidth of our consciousness, our subconscious, emotional and intuitive layers process most of the sounds we hear. This does not mean that we are not affected by what our ears pick up and what our brains are processing. What it does mean is that the effect is not totally controllable by us and that we are, to a large degree, victims of the sonic world. Often this is useful, sometimes it is stressful and sometimes it is fun. We are more or less forced
Sound for Fantasy and Freedom
to interpret and react to what we hear. A sound heard spawns meaning and interpretations based on our previous experiences. In games this can be extremely useful as a way to invite the players to invest and get deeply involved in the game. This relates strongly to the GameFlow criterion “immersion” described above.
sOUND tYPEs At YOUr DIsPOsAL There are a number of ways to categorize and classify sounds. In this context it makes sense to use the three categories traditionally used for sound in films and computer games (Sonnenschein, 2001; see also Hug, 2011; Jørgensen, 2011 for more involved taxonomies of computer game sound): •
•
•
Speech and dialog. Human language brought to sound, the sounding counterpart to the visual text. The most cognitive and unambiguous of the three types often used to convey clear messages with least possible risk of misunderstanding Sound effects and the subcategory ambient sounds. The result of events in the physical world. A falling stone hitting the ground; air fluttering in the feathers of a bird; a mechanical clock ticking; a heavy piece of frozen wood dragged over a horizontal, dry concrete floor; the ever-present, everchanging sounds of the atmosphere Music. Sometimes referred to as “the language of emotion”. An integral part of human cultures since the dawn of Homo sapiens.
Note that these categories are only for clarity and discussion. It is important to point out the fact that, in reality, the possible borders between them are floating. he borders between music particularly and the other two categories have been blurred for centuries: for example, music and dialog in opera
and musicals, music and ambience, and music and sound effects in games and films.
speech and Dialog When you want to convey a clear and unambiguous message, the human voice is a natural choice. The same is true if you want to tell a riddle or recite a poem or just want to be vague and ambiguous. Human language is so rich and there are a myriad of ways to use this in computer game contexts. Speech and dialog can be used to address several of the criteria for player enjoyment included in the GameFlow concept. They can be used to promote concentration on the game by providing a complementary source of stimuli, getting the player’s attention without disrupting the player’s visual focus, or spreading the total game workload on complementary channels, for instance. Sometimes it is necessary to give instructions to the player on what to do next, or what is expected from the game. If you do not want to exclude the player from an ongoing game sequence or if you have problems with limited screen size, using speech as a complement to text is one solution. Today, more and more computing and gaming platforms have built-in support for voice recognition, which means that the player can control the game by issuing voice commands. Since this is totally in line with what we do in our everyday lives, it also supports a very natural way to co-create the game world and to get a desired sense of impact upon it. Speech is a natural way to get feedback from a game on player progress and distance to game goals without having to force the player to shift visual focus to get the necessary feedback. Speech and human voices are totally natural parts of human society and of everyday lives. The human voice is therefore very well suited to making the players forget that they are participating in the game through a medium and it helps to make the game interface less visible and less obtrusive to the player. Voices can therefore be integrated into
31
Sound for Fantasy and Freedom
the background soundscape of the game to give a sense of human presence. Apart from the above-mentioned rather objective and technical uses of speech and dialog, all variations of subjective, expressive and dramatic qualities of the human voice are also available. A bad result uttered with an offensive voice will be something radically different from the same result uttered with a friendly and supportive voice. Here, the thin border between computer games, film, theater and other narrative media is clear.
as freed and part of the physical world through the added sound. Friberg and Gärdenfors use a number of categories for the sounds in the TiM games mentioned above. Most of their categories can be seen as subcategories to the traditional sound effect category. The categories listed by Friberg and Gärdenfors (2004) are:
sound Effects Make it real
•
Events in the physical world generate sounds. It is actually very hard to live and be active in this world without giving rise to sounds. Sounds heard in the physical world are the results of events involving physical objects. Explosions in a combustion engine, oscillations of the vocal cords in your throat, putting down your cappuccino cup on the saucer. Sounds are the proofs that you are still firmly attached to the physical world of your senses. The absence of sound, on the other hand, could be the sign that what you are experiencing is not real, that it is a dream or virtual reality. A green rectangle silently moving over a computer screen is probably perceived as just a green rectangle on the screen. But if you add the sound of a heavy stone dragged over asphalt to this simple animation, the green rectangle automagically turns into a heavy stone. Sound and computer game audio is a bridge on which the virtual visual worlds can travel out and become part of the real, physical world. Ambient or background sounds are the sounding counterparts to the graphic background. Having no ambient sounds is like having a pitch-black visual background and can be perceived as an almost physical pressure on the ears. Adding just a virtually inaudible ambient sound to the virtual world of a computer game can create an immediate experience of presence and reality. The silent virtual world that was locked in can be perceived
32
•
• •
Avatar sounds refer to the effects of avatar activity, such as footstep sounds, shooting or bumping into objects Object sounds indicate the presence of objects. They can be brief, recurring sounds or long, continuous sounds, depending on the chosen object presentation Character sounds are sounds generated by non-player characters Ornamental sounds are sounds that are not necessary for conveying gameplay information, such as ambient music, although they enrich the atmosphere and add to the complexity of the game.
In GameFlow terms this means that sound effects and ambient, background sounds can add to several of the criteria for player enjoyment. Presenting a lot of stimuli to the player on various channels is crucial for the ability of the player to concentrate on the game. We are also used to constantly interpreting the soundscape surrounding us, and a well designed game soundscape will have great potential to grab the player’s attention and help them focus on the game. Sound effects are today absolutely necessary for feedback to the players of computer games. Everything from game control commands issued by the player to virtual events caused by non-player characters can be signaled and embodied using sounds. Sound effects and ambient sounds are very important for player immersion and to involve the player emotionally and viscerally in the game. Many of the sound stimuli that reach our ears are processed subconsciously and handling sound
Sound for Fantasy and Freedom
on this level of perception is totally natural to us. This fact also supports the idea that sound is very well suited to adding to the total experience of immersion in the game world.
Music Makes You Feel Sound in general and music in particular have a very strong ability to touch our feelings. Music works emotionally in two significant ways. Firstly, it tells us stories about feelings that we do not necessarily feel ourselves: the music works like sounding pictures of emotions (Gabrielsson & Lindström, 2001, p. 230). Secondly, music can have the power to induce feelings in us, that is, to actually make us feel (Juslin & Västfjäll, 2008, p. 562). Today, the borders between music, sound effects, ambient background sounds and voices become more and more blurred and music is used as sound effects and sound effects can be used as music. It can therefore be hypothesized that the emotional qualities of music are also, to some extent, true for other types of sounds. Research has shown that music alone, in the absence of supporting pictures or other sensory input, can in many cases and for a majority of people induce feelings of happiness and sadness. Most people can also accurately tell if a piece of music is composed and intended to express sadness or happiness. Other, more complex emotions like jealousy or homesickness are harder to distinguish: Music, alone, has less power to induce such feelings and to actually make us feel them. However, if you add pictures and other media to the musical expression, the musical power increases exponentially. Auditory perception tends to dominate judgment in the temporal dimension (Avanzini,, 2008, p. 390). Music is a special case of this, since it is sound that is highly structured in time. By synchronizing sound and visual movements, very strong effects can be created. Some of the music we hear affects us very individually: it is not universal and does not com-
municate the same thing to two persons. But if the music is paired with something else, for example, a film or a game, something happens. People that are said to hate classical music, and would never put on a recording of classical music, can spend hours watching films with music tracks firmly grounded in western classical music tradition, sounding like something composed in the late 19th Century by Richard Wagner or Gustav Mahler. When musical sounds meet other sensory inputs, for example, music in an animated film, the individual stimuli tend to blend together and become a new whole. The “film + music” object is perceived as being radically different from the film alone and the music alone. The music becomes more universal and has the ability to communicate relatively universal values, emotions, and moods. Music is normally a very linear phenomenon: a song starts at A and ends at B, and the journey between the two is always the same and takes the same amount of time to travel each time. This is especially true of recorded, mediated music. In a non-linear and interactive context, this linear music concept does not necessarily apply. Most often, music has a form that creates successions of tension and relief, which in turn creates expectations on how the music will continue: the music can therefore not be altered as quickly and easily as other media. To function and be perceived as music, it has to follow at least some basic musical rules of form and continuity. A number of techniques and systems have been developed to cope with the gap between linear music and non-linear environments. Many of these are proprietary systems developed by the commercial game developers and are not available to the general public. What most of the systems seem to agree on is a division between a vertical and horizontal dimension. The vertical dimension controls aspects of musical intensity and emotion and the horizontal dimension controls aspects of time and form. The vertical dimension is often implemented using a layered approach whereby a number of musical tracks play in parallel. Each
33
Sound for Fantasy and Freedom
track plays music with a certain content representing a level of intensity or emotion and the game engine cross-fades between the tracks to create the correct blend of intensity and emotion. The horizontal dimension is often implemented using short phrases of 1, 2 or 4 bars linked together. When a transition from one musical segment to another is motivated by the state of the game, the current phrase is played to the end and the chain of linked phrases takes another route than if the game state had not changed.
3D-Positioned Audio Since sounds are the results of physical events in three-dimensional space, it is often vital to be able to give the impression of game sound as emanating from a certain point in a 3D space (see Murphy, 2011). 3D-positioned audio is a powerful technique to bridge the gap between the virtual game reality and the physical world of the player’s senses. This is especially true for sound effects but is also very useful for speech and dialog. Music and ambient sounds are most often not 3D-positioned.
sOUND FOr FANtAsY AND FrEEDOM We cannot hear away from a sound like we can look away from an object, and we have no “earlids” to shut as we can our eyelids. These simple facts makes sound ideal to use if you are looking for new game concepts to contrast the traditional screen and eye-based computer games. Western societies are often said to be vision-based or eye-centric. This suggests that we rely mostly on our eyes and use our other senses and abilities more or less just as support for what we see. In language this is reflected in that we “watch” things. We “watch” TV and films despite silent movies being history since the 1930s. We even “watch” music concerts (at least this is true in Swedish). Our knowledge
34
and awareness about vision, graphic design and so forth is also remarkably higher, more general and more common than their sounding counterparts, as are the creative tools available. In the Association for Computing Machinery’s Computing Classification System (2010), sound and audio are added late compared to, for example, computer graphics. Sound and audio are also mentioned on a lower level (level three) in the classification system, whereas computer graphics is a level two item.
balance the senses Our eyes play a dominant role in our everyday lives and computer game development has traditionally put most emphasis on graphics and vision. At the same time, other modalities and media types such as sound and hearing can be described as underused. This suggests that new computer game concepts could potentially be found by changing the balance between modalities and media types. What happens for example if we reduce graphics and visual stimuli and instead build the gaming experience more on sound and audition? What would the effect be if you had a computer game with only an absolute minimum of graphics and instead a rich, varied and gameplay-driving soundscape? Potentially such a game would be immersive in other ways and give different types of game experiences compared to more traditional, graphics-based games. A couple of things immediately become obvious. First of all, the game designer must let other qualities than computer graphics build and drive gameplay. Secondly the player is liberated from the need to keep her eyes on a 20-something-inch rectangle (in mobile applications only a few inches). Instead, all of a sudden, she becomes free to move over much larger areas or even volumes. Both these open up possibilities to create radically new types of computer games for radically new computer game experiences. They also represent new challenges for both game designers and computer game players (see Hug, 2011 for an expansion of such ideas).
Sound for Fantasy and Freedom
Our auditory perception is good at interpreting sounds as tokens of events. When we hear a sound we know something has happened, matter has interacted with matter. The sounds of broken glass, of cars colliding, of footsteps, our own breathing, and combustion engines all contain information about materials, weights, speeds, surface roughness and so on. In our everyday lives we are constantly immersed in a soundscape that we receive through two streams, one in the left ear and one in the right. From day one we start training our perception in order to be able to make priorities and pick out the relevant information from these two streams. Since sound reaches us from all directions, it can be hypothesized that most of the events we hear, we do not see. In the light of the above, it becomes natural to use sounds as means to convey feedback on both player actions and other events occurring in the virtual world of a computer game. Since sound tells us in a totally natural way about things we do not see, sound can be used to expand the game world far beyond what is displayed on a screen. Sound is very well suited for delivering the feedback and creating the immersion necessary for successful game concepts, as described by the GameFlow concept above. The use of sound to convey information about events, creatures and things that are not visible adds yet another dimension to the game experience: imagination, a word originally meaning “picture to oneself”. When we hear a sound without seeing the sound source we make an interpretation of what we have heard. The interpretation is based on previous experiences of memories of and associations to sounds with similar properties. The interpretation is often subconscious and made without effort. To invite the players to use their imagination, fantasy, and associations to fill out the gaps in this way and complement what they see on the screen is one way to make the players emotionally and viscerally immersed in the game. In a series of research and development projects we have conducted investigations and experiments
based on questions related to the ideas outlined above. These projects have shown that by shifting the balance between graphics and other media types and between eyes and other modalities, games with new qualities can indeed be created: games that attract new user groups and games that can be used in new contexts, in new ways and for new purposes. In this context it is also relevant to make a distinction between gameplay or game mechanics and metaphor. Gameplay can be defined as the set of rules and the mechanics that drive the game, the game’s fundamental natural laws. Metaphor on the other hand defines the world in which these abstract laws work. Gameplay can, for example, define that you are able to navigate in 4 directions called north, south, east and west, that you will be presented with challenges you can either win or lose, and that you win the game by winning a defined number of these challenges. Metaphor defines the world in which the navigation takes place and the nature of the challenges. When gameplay defines an abstract challenge, metaphor can, for example, show an enemy soldier that must be eliminated or it can present the player with a falling egg that must be caught before it hits the floor. A good game must have both a welldesigned gameplay and a metaphor that supports that gameplay: both are equally important. Often, the sound designer works with the metaphor side of a game. The metaphor chosen dictates the possibilities available to the sound designer. A metaphor with a large number of natural sounds that the players are likely to be able to relate to is potentially more immersive than a metaphor with few and/or unknown sounds.
two case studies In two projects, alternative ways to balance visual and audible stimuli in computer games have been explored by the Interactive Institute, Sonic Studio. In the first project, called Beowulf, a game for devices with limited screen size, such as cell phones,
35
Sound for Fantasy and Freedom
Figure 1. The Beowulf game window
was developed (Liljedahl, Papworth, & Lindberg, 2007). In this project, the hypothesis was set up that a game with most of the graphics removed, having, instead, a rich, varied and challenging soundscape, can create a new type of immersive game experience. The hypothesis also included the idea that a game built mostly on audio stimuli will be more ambiguous and open for interpretation than a game built on visuals and that the need for the users to interpret and disambiguate the soundscape will create a rich and immersive game experience with new qualities compared to traditional computer games. The game uses both a well-known gameplay and a traditional metaphor to keep as many parameters as possible constant. Although the gameplay is very simple, the game’s sound-based metaphor makes it a both challenging and rewarding game to play. The Beowulf game world is graphically represented by a revealing map, a map showing only the parts of the game world you have visited so far as a red track (see Figure 1). Your position in the game world is indicated by a blue triangle pointing in your current direction. The player uses headphones to listen to the gameworld, which is described in much greater detail audially than visually. The player navigates this gameworld by listening to sound sources positioned in a 3D space. Navigating includes localizing sound sources by turning and moving to experience changes
36
and differences just as in real life. Feedback on player actions and progress is given by footstep sounds, breathing sounds, the sound of a swinging sword, and other sounds natural in the context of the game’s world metaphor. Immersion is created through the natural and effortless interaction with the sounding dimension of the gameworld. In the second project, called DigiWall, the computer monitor was removed totally (Liljedahl, Lindberg, Berg, 2005). Instead, a computer game interface in the form of a climbing wall was developed (see Figure 2). The 144 climbing grips are equipped with sensors reacting to the touch of hands, feet, knees, and other body parts. The grips are also equipped with red LEDs and can be lit, turning the wall’s climbing area into an irregular and very low-resolution visual display. A number of games were then developed based on a balanced mix of sounds, physical activity, and the sparse visuals of the climbing grips. The absence of traditional computer game graphics and the shift in balance between modalities and media types gives another effect: the games become open for the players to adapt to their own level of physical ability, their familiarity with the games, how they chose to team up, to create variation and so on. In this sense, the new balance between modalities and media types means new freedom for the players.
Sound for Fantasy and Freedom
Figure 2. DigiWall climbing wall computer game interface
Both projects explored questions related to how computer game players could be offered new and unique gaming experiences in terms of freedom and fantasy. In Beowulf the hypothesis was that a shift in balance between eye and ear would invite the players to co-create the game experience and to bring their imagination into play in new ways, compared to traditional, graphics-based games. The studies performed on the game concept showed that, to a majority of players, this was also the case. The DigiWall concept is based on the players’ freedom to use their whole bodies and to play the games by moving over the whole, 15m2 game interface. The absence of a traditional computer monitor also opens up the rules of play in such a way that the users are invited to co-create and adapt the basic gameplays offered to their own needs and desires. In this context it is also important to mention the term “user investment”. Both projects eventually showed that the need to interpret and disambiguate the soundscape of the games was in fact an asset. Both games more or less forced the players to use their own imagination and experiences to flesh out the sounding skeletons supplied by the game’s metaphors. In the Beowulf case, the user investment was expressed as high-ranking in game satisfaction as well as in vivid descriptions of the
gameworld’s environments, materials, temperature, atmosphere, inhabitants and so forth none of which had any visible cues. In the DigiWall case, positive user investment ranked highly both in player satisfaction and the subsequent publicity and commercial success of the project. In these projects, audio is used in a number of ways to create a sense of presence and to link, as closely as possible, the virtual reality of the game to the physical reality of the player. Sound was also used to communicate instructions, cues, clues, feedback, and results from the game to the player. The aim was to create new balances between sound and graphics compared to traditional computer game applications and to explore if and how sound could be used to drive gameplay and to create fun, challenging, rewarding and immersive gaming experiences. The aim was also to use sound to blur the borders between the virtual reality of the game and the physical reality of the player. In both cases, game metaphors were chosen to match the gameplays and to present as many possibilities and large design spaces as possible for the sound designers. Here follows a brief description of how sound was implemented in the two projects.
37
Sound for Fantasy and Freedom
Ambience and background to bridge the reality Gap Physical environments are (almost) never silent. Air, water, objects, creatures and machines around us all more or less make sounds. The absence of sound is unnatural and scary; it is an auditory counterpart to pitch black. Sounds are the signs of presence, life and function. By adding just a very soft sound of moving air, an otherwise dead and detached game environment can come alive. If the sound is well designed, it is possible to create an experience where the game-generated sounds blend with the sounds from the gamer’s physical environment, creating an inseparable whole. The gap between the realities closes. Ambient sounds can be strong carriers of emotion and mood. They share this ability with music and the fact is that the border between the two is more and more often blurred by film and game sound designers (Dane Davis, cited in Sonnenschein, 2001, p. 44). Carefully “composing” an ambient or background sound can serve several purposes at the same time. It can create a sense of physical presence, it can set the basic mood and it can communicate emotion and arousal. In the Beowulf game, the ambient sounds were the sound of air softly flowing through the gameworld’s system of caves and tunnels. The sounds had a slight amount of reverb added to create a sense of volume in the caves and the reverb was removed for tunnels. The ambient sounds were also deliberately freed as much as possible from musical components such as pitch and rhythm: We wanted to give the players as much freedom as possible to use their own imagination, not influencing them in any direction defined by us more than necessary. Most of the DigiWall games use music tracks as ambient and background sounds. In this case the purpose is the opposite. Music is used to set the basic mood of the games and to encourage physical activity in the players. The music is designed to communicate subconsciously with the
38
players and, for example, “whisper” that speed is increasing or that time is running out and you must hurry.
sound Effects and Music for cues and clues Often game designers want to encourage the players to go in certain directions or to take certain actions. By carefully planting sound effects and/ or music, the player can be guided, inspired or even intentionally misled. Beowulf uses a large number of natural sounds to warn the player of potential dangers such as predators, bottomless holes or boiling lava. The DigiWall games use music and sound effects with musical properties to guide attention in certain directions on the wall. One example is the game Catch The Grip, in which the direction from the last grip caught to the next to catch is represented by a series of notes. The length of the series tells the physical distance on the wall. The panning of the notes in the loudspeaker system signals the direction left/ right. In the game Scrambled Eggs, sound effects with a falling pitch denote the movement of “eggs” falling from the top of the wall towards the floor.
speech, Music and sound Effects for Information and Feedback Many sounds are emotional and meant to create and communicate mood and presence. Other sounds are meant to convey cognitive information about rules, scores, results and so forth. Speech is, of course, very versatile and useful in this case. It is very effective to have a voice read the initial instructions for a game, especially if it is a game with relatively simple gameplay and few rules. The same is true for scores and results. Who won, the left or right team? How many points did you score? To have a voice read these results creates a strong feeling of presence and makes the game come alive. One drawback with speech is, of course, language. For example, Swedish voice-
Sound for Fantasy and Freedom
overs in a game do not make very much sense in the UK. As with text, it is necessary to have localized versions and this quickly starts adding cost in terms of computer memory, coding, development time, and other resources. But then again, sometimes it is worth it. In the DigiWall games serving as an example here, speech is used as introduction to all games. A majority of the games also present scores and results using speech. The DigiWall game interface is equipped with two buttons, so the players can select one of two available languages. A danger with speech is the risk of wearing out often-repeated phrases. It is therefore useful to give the players the option to skip, for example, instructions when they are no longer needed. Music and sound effects can also serve as carriers of information, albeit not as clear and unambiguous as speech. This is not an innate disability though, but rather an effect of the way we use music and sound effects. Rhythm, for example, can be used to convey semantics just as well as any speech: what is required is simply to learn the system (Morse code, for example). One of the advantages with sound effects and music is that they are not limited by language, but are more universal. This can of course be used in many ways. In the Beowulf game, each new round starts with a short, horn melody, as if it were announcing the approach of the king’s ambassador. The players learn very quickly what this signal means and, since it is very short, the risk of becoming bored with it is minimal. Beowulf also uses pure music to signal success and failure. Success is signaled by a short triumphant brass fanfare and failure is signaled by a short funeral march. By carefully selecting the metaphor aspect of a game’s design, tremendous opportunities to create sound effects for feedback and information can be opened up. By placing the game in an environment (metaphor) that the players are likely to have some kind of relation to, the designer can choose sounds for feedback and information that are natural in that environment. Using natural
sounds that the players can immediately relate to can greatly enhance the gameplay aspect of the same game as well as create the sought-after sense of presence and immersion. The DigiWall’s game Scrambled Eggs uses the sound of broken eggs to signal points lost and the sound of an egg rescued in the palm of your hand to signal points gained. In Beowulf, if the player enters a forbidden game tile, the sound of a scream receding down a hole together with the sound of falling rocks signals life lost. When this is followed by the funeral march, failure and the end of the game are obvious to anyone, without the need for speech or text.
FUtUrE rEsEArcH DIrEctIONs It is often said that sound is still underused and that audio is a media type with potential yet to be unleashed. In order to free this unused potential, research and development efforts must be carried out on several parallel fronts. We need to develop more in-depth knowledge about auditory perception and how heard experiences affect users of computer games and other interactive systems. This also implies richer taxonomies and more developed languages for writing about, talking about and reflecting over this new knowledge and making it useful in wider contexts. Furthermore, a number of current ideas and traditions in the field must be challenged and a set of updated ideas must be developed. Ambiguity and wider interpretation spaces treated as design assets rather than problems in the design of interactive systems is one example. Another example is when simple efficiency metrics for player enjoyment are replaced with more complex systems for the design and evaluation of computer game experiences, such as the GameFlow concept. Finally, new technology that can carry and realize the new knowledge and ideas must be developed. This includes technologies for procedural audio (see Farnell, 2011; Mullan, 2011 for further descriptions of this technology)) and systems for dynamic
39
Sound for Fantasy and Freedom
simulation of room acoustics and acoustic occlusion and obstruction, just to name a few.
cONcLUsION Sound is a complex stimulus and it is only in recent years that science has started to understand auditory perception in any depth. Much of the knowledge and practice in sound design for computer games and other interactive applications is based on experience and anecdotal evidence. But the awareness of sound’s potential and scientifically-based knowledge in sound design is slowly increasing. This is not only true in the computer game industry, but in industry and society in general. The implications of the fact that our ears and our eyes complement each other are slowly beginning to have an effect. Graphics alone gives one type of experience: sound alone gives another type of experience, and graphics plus sounds gives new and unique experiences. By working with the balance of ears, eyes and other senses and human abilities, new opportunities emerge for the computer game designer. The Wii, Dance Dance Revolution and DigiWall are just a couple of examples of this. Sounds in the physical reality of our bodies are the results of physical events in that same reality. Our hearing is designed and “hardwired” to constantly scan and analyze the soundscape surrounding us and react rationally to the sounds heard. Most of the time this is done subconsciously and our hearing can therefore be described as, to a large degree, intuitive, emotional, or pre-cognitive. The soundscape reaching our ears demands interpretation and disambiguation in other ways than the visual stimuli reaching our eyes. This need to interpret and disambiguate can be turned into a great asset in computer game design. A game with a well-designed, rich, and varied soundscape will play on the user’s intuition and emotions: the game will be immersive and give fun and rewarding gaming experiences.
40
How we interpret a sound depends on, and draws from, our previous personal experiences. Well-known sounds will spawn a myriad of pictures in our inner, mental movie theaters. Unknown sounds can create both confusion and excitement. Working in parallel with the gameplay and the metaphor aspects of computer game design, and making sure that the two match and support each other, is a powerful way to find and design the sounds that build the total soundscape of the game. By working in parallel with and carefully balancing the graphics and the sounds of a computer game the users’ bodies and fantasies can be set free, creating unique, immersive, and rewarding gaming experiences.
rEFErENcEs Association for Computing Machinery. (2010). ACM computing classification system. New York: ACM. Retrieved February 4, 2010, from http:// www.acm.org/about/class/. Avanzini, F. (2008). Interactive sound . In Polotti, P., & Rocchesso, D. (Eds.), Sound to sense, sense to sound – A state of the art in sound and music computing (pp. 345–396). Berlin: Logos Verlag. Bregman, A. S. (1990). Auditory scene analysis: The perceptual organization of sound. London: MIT Press. Cowley, B., Charles, D., Black, M., & Hickey, R. (2008). Toward an understanding of flow in video games. ACM Computers in Entertainment, 6(2). Csíkszentmihályi, M. (1990). Flow: The psychology of optimal experience. New York: Harper Collins. Dance dance revolution [Computer game]. (2010). Tokyo: Konami. DigiWall [Computer game]. (2010). Piteå, Sweden: Digiwall Technology. Retrieved February 10, 2010, from http://www.digiwall.se/.
Sound for Fantasy and Freedom
Ekman, I., Ermi, L., Lahti, J., Nummela, J., Lankoski, P., & Mäyrä, F. (2005). Designing sound for a pervasive mobile game. In Proceedings of the ACM SIGCHI International Conference on Advances in Computer Entertainment Technology,2005. Farnell, A. (2011). Behaviour, structure and causality in procedural audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Friberg, J., & Gärdenfors, D. (2004). Audio games: New perspectives on game audio. In Proceedings of the ACM SIGCHI International Conference on Advances in Computer Entertainment Technology2004, 148-154. Gabrielsson, A., & Lindström, E. (2001). The influence of musical structure on emotional expression . In Juslin, P., & Sloboda, J. A. (Eds.), Music and emotion: Theory and research. Oxford, UK: Oxford University Press. Gaver, W. (1993). What in the world do we hear? An ecological approach to auditory event perception. Ecological Psychology, 5(1), 1–29. doi:10.1207/s15326969eco0501_1 Gaver, W. (1997). Auditory interfaces . In Helander, M. G., Landauer, T. K., & Prabhu, P. (Eds.), Handbook of human-computer interaction (2nd ed.). Amsterdam: Elsevier Science. doi:10.1016/ B978-044481862-1/50108-4 Gaver, W. W., Beaver, J., & Benford, S. (2003). Ambiguity as a resource for design. Proceedings of the ACM CHI Conference on Human Factors in Computing Systems, 2003, 233-240. Hug, D. (2011). New wine in new skins: Sketching the future of game sound design . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global.
Jegers, K. (2009). Elaborating eight elements of fun: Supporting design of pervasive player enjoyment. ACM Computers in Entertainment, 7(2). Jørgensen, K. (2011). Time for new terminology? Diegetic and non-diegetic sounds in computer games revisited . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Juslin, P. N., & Västfjäll, D. (2008). Emotional responses to music: The need to consider underlying mechanisms. The Behavioral and Brain Sciences, 31, 559–621. Larsson, P., Västfjäll, D., & Kleiner, M. (2002). Better presence and performance in virtual environments by improved binaural sound rendering. In AES 22nd International Conference on Virtual, Synthetic and Entertainment Audio. Lecanuet, J. P. (1996). Prenatal auditory experience . In Deliège, I., & Sloboda, J. (Eds.), Musical beginnings: Origins and development of musical competence (pp. 3–36). Oxford, UK: Oxford University Press. Liljedahl, M., Lindberg, S., & Berg, J. (2005). Digiwall: An interactive climbing wall. Proceedings of theACM SIGCHI International Conference on Advances in Computer Entertainment Technology, 2005, 225-228. Liljedahl, M., Papworth, N., & Lindberg, S. (2007). Beowulf: An audio mostly game. Proceedings of the International Conference on Advances in Computer Entertainment Technology, 2007, 200–203. Mullan, E. (2011). Physical modelling for sound synthesis . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global.
41
Sound for Fantasy and Freedom
Murphy, D., & Neff, F. (2011). Spatial sound for computer games and virtual reality . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global.
Tuuri, K., Mustonen, M. S., & Pirhonen, A. (2007). Same sound – different meanings: A novel scheme for modes of listening. In Proceedings of Audio Mostly 2007 – 2nd Conference on Interaction with Sound, 13-18.
O’Callaghan, C. (2009 Summer). Auditory perception. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy, Retrieved January 24, 2010, from http://plato.stanford.edu/archives/ sum2009/entries/perception-auditory/.
(2010). World of warcraft [Computer game]. Reno, NV: Blizzard Entertainment.
Parker, J. R., & Heerema, J. (2008). Audio Interaction in Computer Mediated Games. International Journal of Computer Games Technology, 2008, 1–8. .doi:10.1155/2008/178923 Polaine, A. (2005). The flow principle in interactivity. In Proceedings of the Second Australasian Conference on interactive Entertainment. Reid, J., Geelhoed, E., Hull, R., Cater, K., & Clayton, B. (2005). Parallel worlds: Immersion in location-based experiences. In CHI ‘05 Extended Abstracts on Human Factors in Computing Systems. Sengers, P., Boehner, K., Mateas, M., & Gay, G. (2008). The disenchantment of affect. Personal and Ubiquitous Computing, 12(5), 347–358. doi:10.1007/s00779-007-0161-4 Sengers, P., & Gaver, B. (2006). Staying open to interpretation: Engaging multiple meanings in design and evaluation. Proceedings of the 6th Conference on Designing Interactive Systems, 2006, 99-108. Sonnenschein, D. (2001). Sound design: The expressive power of music, voice and sound effects in cinema. Studio City, CA: Michael Wiese Productions. Sweetser, P., & Wyeth, P. (2005). GameFlow: A model for evaluating player enjoyment in games. ACM Computers in Entertainment, 3(3).
42
ADDItIONAL rEADING Altman, R. (Ed.). (1992). Sound theory sound practice. New York: Routledge. Boehner, K., DePaula, R., Dourish, P., & Sengers, P. (2005). Affect: From information to interaction. In Proceedings of the 4th Decennial Conference on Critical Computing: Between Sense and Sensibility. Brown, E., & Cairns, P. (2004). A grounded investigation of game immersion. In CHI ‘04 Extended Abstracts on Human Factors in Computing Systems. Juslin, P., & Sloboda, J. A. (Eds.). (2001). Music and emotion: Theory and research. Oxford, UK: Oxford University Press. Kaptelinin, V., & Nardi, B. A. (2009). Acting with technology: Activity theory and interaction design. Cambridge, MA: MIT Press. Norman, D. A. (1988). The design of everyday things. New York: Basic Books. Polotti, P., & Rocchesso, D. (Eds.). (2008). Sound to sense, sense to sound: A state of the art in sound and music computing. Berlin: Logos Verlag. Schafer, R. M. (1977). The soundscape: Our sonic environment and the tuning of the world. Rochester, VT: Destiny Books. Sider, L., Freeman, D., & Sider, J. (Eds.). (2003). Soundscape: The School of Sound lectures 1998 – 2001. London: Wallflower Press.
Sound for Fantasy and Freedom
KEY tErMs AND DEFINItIONs Auditory Perception: The process of attaining awareness or understanding of auditory information or stimulus. Avatar: A controllable representation of a person or creature in a virtual reality environment. Feedback: Output from a computer game to inform the user of various changes in game state. Flow: The mental state of operation in which a person is fully immersed in what he or she is doing by a feeling of energized focus, full involvement, and success in the process of the activity.
Gameplay: The rules and mechanics defining the functionality of a computer game. Game Metaphor: The embodiment of the virtual environment comprising the game world. Immersion: Deep mental involvement. Pervasive Game: A computer game tightly interwoven with our everyday lives through the objects, devices and people that surround us and the places we inhabit. Suspension of Disbelief: A silent agreement between an audience and an entertainment producer in which the audience agrees to provisionally suspend their judgment in exchange for the promise of entertainment.
43
44
Chapter 3
Sound is Not a Simulation: Methodologies for Examining the Experience of Soundscapes Linda O’ Keeffe National University of Ireland, Maynooth, Ireland
AbstrAct In order to design a computer game soundscape that allows a game player to feel immersed in their virtual world, we must understand how we navigate and understand the real world soundscape. In this chapter I will explore how sound, particularly in urban spaces, is increasingly categorised as noise, ignoring both the social significance of any soundscape and how we use sound to interpret and negotiate space. I will explore innovative methodologies for identifying an individual’s perception of soundscapes. Designing virtual soundscapes without prior investigation into their cultural and social meaning could prove problematic.
INtrODUctION Simmel (as cited in Frisby, 2002) argues that the exploration and navigation of a space, particularly an urban space, impacts all of the human senses. Equally he suggests that when exposed to multiple inputs of both internal and external stimuli, we make choices, such as movement and interaction, based on the sensory information of a given space (Simmel, 1979). In the design of gameworlds, we must examine this concept of sensory input as both a method of navigation and socialisation.
Within a real world all the senses are exposed to information, sight, sound, smell, and touch. Within a gameworld, we are currently exposed to an overriding visual experience and minimal sound information. There is a deficit of sensory information occurring within this digital world and, as more people move towards gaming and virtual communities, this deficit must be examined. For digital virtual worlds to create a convincing immersive experience with the technology that is available, we must explore sound as well as sight in the construction of gameworlds from a sociological perspective.
DOI: 10.4018/978-1-61692-828-5.ch003
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Sound is Not a Simulation
Thompson (1995) argues that when we enter virtual spaces or communities we leave orality behind: he sees no space for sound within virtual worlds or online communities. It has been prevalent in social and media theory to ignore the experience of sound in a space, whether that sound is produced by human activity or by other natural sources. It is my argument that sound plays a part in the social construction of space, whether real or virtual, either by its presence or absence. Equally, I will argue that sound which is produced by objects through reverberation and other acoustic qualities can affect how we navigate or place meaning in a space. I will also explore the process of control which is dominating research into the Soundscape, this is primarily due to an increasing awareness of the side effects or apparent dangers of loud sounds on people. The need to monitor and control sound in the environment has become a predominant research focus within soundscape studies. Sounds within urban centers are increasingly seen as a by-product of industry and technology: this has led to the creation of noise policies within a number of countries. Sound is increasingly seen as a measure of sound pressure levels rather than being seen as a social structure (Blesser & Salter, 2009, pp. 1, 2). This is significant for sound designers who wish to gather data on the meaning of sound within society. If a sound designer considers sound only in relation to volume, noise, or other objective criteria they might ignore the meaning sound beyond its output level. In looking at the social and perceptual aspects of sound we are constructing what Feld (2004) would call an acoustemology of the sound world. He increasingly acknowledges that soundscape studies, which react to human interventions to the natural soundscape, ignore cultural systems which develop as a result of being immersed and surrounded by sound. The game space, or any virtual space which asks a person to become immersed in it, needs to be founded upon an understanding of the sociologi-
cal impact of sound on the individual and society. A game designer must also take into account the more abstract representation of sound that is experienced in art, cinema, and other mediated spaces. There is already a history of the experience of sound through mediatisation (Bull, 2000; Cabrera Paz & Schwartz, 2009; Cohen, 2005; Drobnick, 2004): the difference between these theories and the theory of game sound design is the concept of immersion, interactivity, and simulated reality. What describes a soundscape, who defines the description and what models are used to categorise levels of sound and their meaning? There are no set methods for the study of acoustic ecology or the soundscape from a sociological perspective. I propose an interdisciplinary method which will draw on social theory, media theory, and sound design. In order to explore the soundscape we must incorporate different methods and theories to analyze the social impact of the soundscape, real or virtual on the individual and the group.
tHE EXPLOrAtION OF tHE sOUNDscAPE Some of the earliest documented exploration of the modern soundscape arose from within the arts and modern music composition. Those who practised the art of listening explored the changes in our early soundscape, technology was seen to change our soundscape, but this was not seen as a negative event (Luigi Russolo’s 1913). Luigi Russolo’s 1913 manifesto, The Art of Noise, posited that sound had reached a limit of invention, technological sounds allowed for an “enjoyment in the combination of the noises of trams, backfiring motors, carriages, and bawling crowds”. He argued that in listening and using these sounds as types of music we would create an awareness of the rapidly changing soundscape. In an ever changing technological climate, we would increasingly be exposed to new types of sounds at a faster rate than at any time preceding mechanisation. The
45
Sound is Not a Simulation
soundscape would also play a much stronger part in the construction of music and sound art with the introduction of audio recording devices. However over time this modern soundscape became less a usable musical landscape or instrument and more like an environmental pollutant (Bijsterveld 2008). Bijsterveld (2008) argued that technology became a symbol of the loudness and unhealthy character of the urban soundscape. Schafer’s examination of the soundscape in the 1960s was guided by an awareness of the increased levels of sound within urban centers as well as (Cohen, 2005; Schafer, 1977). He argued that the spread of industrialisation, polluted not only such physical spaces as water and land but the hearing space, leading to an alteration of the perceived space for animals and humans. Sound, or what was now being called noise, was increasingly seen as a negative side effect of industry. Schafer’s research focused on a reification of past soundscapes and the preservation of soundmarks (similar to historic landmarks). The World Soundscape Project, established by several people including Murray Schafer, in the late 1960s, proposed a practice of recording the landscapes of different spaces around the world. They wanted to record and archive certain landscapes they felt were being transformed as a result of a noisier soundscape. These recordings would then highlight the effect that increased sound levels were having on certain spaces. Although Schafer brought the sound world into the equation as a factor within industrial change, very little focus has been on the positive aspects of contemporary soundscapes or their social meaning. Human activities produce sound, we are also embedded in sound, space becomes revealed to us through sound and, as spaces become more built up or newly transformed our ability to see beyond our immediate space becomes limited. Blesser and Salter (2009) argue that sound allows us to envision our space; a space becomes revealed to us through its “aural architecture”. They examine the ability of humans to restructure their sound
46
environment, to act back on loud sound spaces and argue that because constructed spaces remain static it is through social behaviour that we have the ability to modify our sound arena.
the Designed space De Certeau (1988) argues that the city is a representation of political economy, historical narrative and social forces of capitalism and while architects and planners see the whole, the vista, the individual who lives and works in the city will never see it in totality. He suggests that we walk the city blindly, reconstructing our own narratives of space. De Certeau implicates sound without referencing it as a way to see an invisible whole. He argues against the rationalising of the city or functionalist utopianisms, allowing for the transformation of space by those that live within it. Adams et al (2006) suggest that “a soundscape is simultaneously a physical environment and a way of perceiving that environment” (Adams et al., 2006, p. 2). They see the soundscape as a construct through which we will navigate. Adams et al. and de Certeau understand that the construction of space and our ability to navigate through it is dependent on more information than the visible. In recreating the soundscape in digital landscapes, the designer pays homage to the real world she tries to replicate: she codes, intentionally or not, the universalisms of design into the construction of her virtual space. The space is built to replicate the reflection of sound against object, as if this is the only way sound moves through space or equally the only way we perceive it. She is equally guided by the epistemology of sight as the “the epistemological status of hearing has come a poor second to that of vision (Bull & Back, 2004, p. 1). Like any other visual medium, the design makes assumptions on how sound should be perceived in any constructed space. This functional approach only measures our potential physiological responses to sound. It does not explore the individual or community experience of sound or
Sound is Not a Simulation
the subjective and immersive experience of time and space through either real world listening or mediated listening. Augoyard and Torgue (2006) theorize that sound may guide social behaviours: they argue that no sound event can be removed from “spatial and temporal conditions” and that sound is never experienced in isolation. They have adopted qualitative approaches to the exploration and analysis of sound in urban spaces. Augoyard and Torgue argue that the term “soundscape” is tied to a certain empirical model of measurement which may be too narrow in its meaning, belonging more to a textual rather than observational critic of “acoustical sources” and “inhabited spaces” (2006, p. 4). They suggest that the term sonic effect better describes the experience of sound within space. It breaks the analysis of sound into three distinct fields: “acoustical sources, inhabited space, and the linked pair of sound perception and sound action” (Augoyard & Torgue, 2006, p. 6). Each of these fields are required in order to examine the ubiquitous nature of the soundscape as a process which impacts on social, physiological as well as psychological behaviour. What is most difficult to analyse, but fundamental to the soundscape design is the subjective experience of sound. When constructing a virtual landscape, the primary consideration is—and for a number of games it is the only goal—the reaction time of game player interaction: if I shoot, will I hear the sound of gunfire instantaneously?
MEDIAtED LIstENING The numbers of people turning to electronic devices (mp3, walkman, ipod, mobile games, and laptops) as a means of shutting out real world sounds has increased exponentially in the last decade (Bull, 2000). The personal headphone has played a part in reconfiguring the landscape, allowing us a choice in how we perceive our world and how we are perceived as taking part in or stepping out
of real time and space. Thompson (1995) explores the change in perception of “spatial and temporal characteristics of social life” (Thompson, 1995, p. 12) due to the development of communications technology. He recognises that the role of oral traditions has changed: face to face contact is eliminated in favour of virtual communications. Bull (2000) argues that mediated listening is now used as a means to escape the “urban overload” of our cities and suggests that the use of mobile technology for listening to the radio or to music collections affords a breather or a meta-physical removal from the real world. How we shift between these acoustic environments, and how our personality and behaviour may be manipulated, both by our apparent control of one type of space and our lack of control over another, may affect social patterns of relating to each other and the world we inhabit.
sound control Research has shown that the reasons for putting on headphones are motivated by numerous factors, such as (Bull, 2000). Erving Goffman’s (1959) theory of civil inattention addresses this concept. He examines the unwillingness of the individual to be seen in public spaces and explores the notion of contexts structuring “our perception of the social world” (as cited in Manning, 1992, p. 12). Goffman suggests that social spaces are framed and, within these frames, we act a certain way. How we act is perceived as being the acceptable or normal behaviour for those spaces and he uses the example of the elevator space: when travelling in such a confined space, the “normal” behaviour is to look anywhere but at another person’s face. Mediated spaces contain their own framed context. When we engage in a fully immersive experience, such as gaming or mediated listening, even if this happens in a public space we are not seen to be ignoring the real world. We are seen to be engaged within another space, one which requires our full attention.
47
Sound is Not a Simulation
Bull’s (2000) research also highlights how the perception of time becomes distorted when listening to personal headphones. For some, listening is required to manage the boredom of “slow time”. It is also used to negotiate a path through space, a path which is experienced through a virtual soundscape or soundtrack and this alters the listener’s perception of time. Bull’s studies have revealed that time is almost always a reason for engaging in mediated listening. This concept of controlling space and time, through mediated listening, suggests that the senses required for listening extend beyond simply hearing. If the experience of listening alters the perception of time and space then reality also becomes less fixed and more flexible. Lefebvre (2004) argues that time and the everyday life exists on multiple levels and that the experience of time contains a value coding, depending on the task being done. He suggests that time is both fundamental and quantifiable and that quantifiable time is an imposed measure which is based on the invention of clocks and watches. When engaged in mediated listening (radio, sound art, audio books, and games, for example), time may be re-appropriated. We are experiencing what Schafer called a schizophonic shift in perception, where, by means of mediated listening we exist between two time zones, one created by our imagination and the other by the world around us. Devices, such as stereo headphones, mobile phones and portable games, which we use to pull us out of time, also act as filters: they give us the choice to decide what it is we hear and do not hear. Equally, we can choose to hear both spaces, real and mediated, so that we do not become so distracted in our mediated listening that we walk under a car. The increased use of mediated listening devices, particularly in public spaces, might be seen as an adaptation to the increase in sound levels within urban spaces. It could also be as a result of the sheer diversity of sounds that exist within our world, most of which have no meaning or relevance in our day to day lives.
48
There are massive assumptions being posited by researchers into the field of noise or increased sound levels. Schafer and the World Forum for Acoustic Ecology argue that increased sound levels are creating a rift between the natural world and humanity’s relationship to it. They support research which is concerned with the “preservation of natural and traditional soundscapes” (Epstein, 2009). This focus on the conservation of older or traditional soundscapes ignores the “everyday urban situations impregnated with blurred and hazy...sound environments” (Augoyard & Torgue, 2006, p. 6).
NOIsE: tHE sIDE EFFEct OF INDUstrY The term noise is often used to describe unwanted sound or sound that, in its make-up, carries certain characteristics that define it as negative. Schafer’s early work on the soundscape explored ways of quantifying noise levels. One of his early explorations into the soundscape used a system of tables which measured the amount of complaints made against certain noise sources and this project was carried out in several countries. Schafer’s research concurred with what most people would suspect: in most modern cities, traffic is seen as a pollutant both for carbon emissions as well as sound levels. Yet in Johannesburg, South Africa, we see a very different picture in relation to what is seen as noise and what is accepted as city sounds (Schafer, 1977, p.187). The vast majority of complaints for sounds considered intrusive or annoying were made against the increased sounds of animals and birds within the city: unusually, the smallest numbers of complaints were directed towards traffic. It could be argued that one type of sound is seen as normal and part of the everyday urban while the more natural sounds no longer fit with the concept of an urban landscape.
Sound is Not a Simulation
sound as side Effect One of the areas in which noise pollution has focused on within the urban soundscape is that of the motor vehicle, which is seen as a major contributor to increased sound levels within cities and towns. Bijsterveld’s (2004) historical analysis of noise laws, highlight the increasingly negative public opinion directed towards the motor vehicle since the turn of the century. The city was increasingly seen as a space which had once held silence and that this silence needed to be regained, either through the removal of motor vehicles or severe noise laws. Yet, over the decades, a relationship has developed between motorists and the sounds of their vehicles, an idea which is being explored by Paul Jennings. Jennings’ (2009) research focuses on the positive aspects of sounds produced by cars, from the sound of the door shutting, to the sounds of a petrol engine. He explores the various ways of simulating the sounds emitted by cars; studies have revealed that drivers have developed a relationship to the sounds produced by cars such as power, control, and drivability and so on. Simultaneously further research has shown that car sounds exterior to the vehicle are an important factor to visual orientation, particularly to the blind, hard of sight and cyclists (“Fake Engine Noises” 2008). The sound of a vehicle has become an inherent part of the urban soundscape and it is used to measure distance, speed, and time. In virtual terms, this association to a vehicle’s individual soundscape has new meaning. If, for example, the hybrid car (electric and petrol and very quiet) becomes more prevalent in society, will we change the perceived soundscape of the urban space? For decades, we have associated the sounds of cities with vehicles and they have become a significant part of the urban soundscape, an ambience that defines the metropolis. If this sound disappears what effect might this have on our relationship to both the city and its transport?
Our relationship to the Modern soundscape Industrialisation has had a major impact on civilisation, and the association of sound to production is seen as implicit. If we introduce noise abatement laws to tackle sound levels we ignore the relationship that has evolved between humans and the sounds of mechanisation and industrialisation. In our concern for the soundscape and its possible effects on humans we may change our soundscape to create a perceived better sound level or quality, but ultimately we might also change the relationship people now have to cities or industrial centres. It is necessary to fully understand the relationship that groups and individuals have to the urban soundscape, specifically the sounds that are reminders of its urbanity, economy, and population as well as its activities. MacLaran (2003) argues that the urban space is increasingly becoming partitioned and that the individual increasingly tries to locate a private space in which to claim ownership. With geographic boundaries becoming increasingly part of the urban space, defined by economics, politics and as a reaction to overpopulation, the urban space is increasingly seen as a “mirror of the societies that engender them” (MacLaran, 2003, p. 67). Yet Thompson (1995) suggests that a changing landscape is part and parcel of the urban metropolis, people have and will adapt to further architectural or cultural shifts within urban areas, creating new cultures and social movements that stand alongside these changes to the landscape. What is not considered by these researchers is that a city is more than its visual or geographical cues. Thompson argues that within the media, particularly the internet, new social structures will form within virtual spaces, and these will, to a certain extent, replace the physical world in developing community and place which is increasingly seen as crowded. Yet within mediated environments and the real world there is no real consideration to the soundscape and its importance as a social
49
Sound is Not a Simulation
construct in the formation of identity and society. There is a substratum of symbolic content associated with the visual space; Schafer’s research has created a set of hermeneutics from which soundscape studies may draw. It is necessary to create dialectic on the soundscape, one which poses questions of meaning, noise, control, structure and interpretation. This becomes more significant as urban and governmental policy move towards controlling sound. If we operate on the basis that sound is a set of objects which can be assessed by their levels rather than their meaning, we will construct passive digital soundscapes. While the study of sound through the social and physical sciences have advanced towards exploring sound as a subject, we are gradually moving towards an acoustic epistemology which embraces the ephemerality of sound. It is both sensorial and primary, a subject which needs fundamental and theoretical frameworks which can be realised through methodological research. Unfortunately, in rushing towards categorising sound and its effects, certain policies have been created to simply categorize sound as noise, not understanding the many social contexts which may explain why, “despite successful implementation of noise maps and action plans…there is little evidence of preventing and reducing environmental noise” (“Working Group Noise Eurocities” n.d.). These policies fail to understand that sound has many social contexts and that this means understanding that sound is not simply a signifier of some otherness, an association with a producer; a product or side affect of technology, car sounds, factory sounds etc. What this underlines is that there is a need to explore the control issue which has arisen within soundscape research, if sound is being seen as a negative effect of industry and modernism one which seems beyond the individuals control then we have a concept to explore in virtual soundscapes. The positive act of listening in a virtual soundscape is that the sound can be controlled, be it
50
through volume or interactive means of changing the sound environment. In the visual world of games certain elements are static and the controller cannot change or effect the environment. This is based on the conceptual approximation of reality, (a tree is a tree and must remain so in order to simulate reality). If we introduce ambient sound it too must approximate this idea a gamer can close their eyes to shut out the world, but no one can close their ears. But as in the real world we can create or find spaces of acoustic interest to us, we can in a virtual environment turn of an engine, perhaps a gamer should be able to turn off all engines and close down (or destroy) factories and other sounds they perceive as unwanted in their soundscape. Equally the soundscape should simulate reality, the ambient soundscape whatever that is must be all surrounding and there must be limits to the control of this sound that is if the intention to approximate the physicality of space. I do not propose that we draw attention to the soundscape within games, the more real a soundscape seems, the less a gamer would notice it. Instead we must consider that to increase the perception of immersion the soundscape must reflect or approximate a real world soundscape, rather than being as a “bit part player to the visual star” (Grimshaw & Schott, 2007, p. 2). Ambient sound denotes a sound that surrounds all physical space; it has been defined by some as foreground, middle ground and background sound (Adams, 2009; Schafer, 1977). This three part description of a soundscape lays out sound, within both the virtual and the real world, as an assemblage, one which is created as a result of reverberation, dynamics, levels and acoustics. These three characteristics imply that sound can be split apart to understand its workings, and then reconstructed as a virtual soundscape, that is if we ignore how sound is socially and psychologically perceived. While technology can break sound apart so that we can hear minute elements of the whole, we physically hear sound in its entirety because we cannot shut out sounds; we do not have what
Sound is Not a Simulation
Schafer called “earlids”. We comprehend that sound may be reaching us from particular distances or places, and we make choices in regards to what we consider important sounds to listen to, but we cannot choose to not hear sounds within our hearing range. Equally we inhabit and work in spaces that produce sounds that we have to make meaning from and that we contribute to, our entire lives are spent surrounded by sounds. So how do we make meaning from these sounds and how do we measure that meaning? If we wish to simulate the experience of being within a space whether this space is a war zone a different planet or the North Pole, we must understand that sound is socially and culturally constructed (Drobnick, 2004). For sound design this is paramount, if we wish to create a simulacrum of the real we must understand to what extent sound plays in our navigation both physically and socially of spaces.
IMMErsION AND sIMULAtED rEALItY It is the concept of immersion which guides design within the gaming industry, being seen as the “holy grail of digital game design” (Grimshaw, Lindley, & Nacke, 2008). Graphic design in gaming has evolved through several stages of realism, towards the appearance or “illusion of life” (Hodgkinson, 2009, p. 1). One outcome of this simulation of the real world within digital games can be seen in the film industry. Films are produced which have been based on games: Tomb Raider (West, 2001) and Resident Evil (Anderson, 2002). Equally we have movies which resemble gameworlds and the gameworld concept: Final Fantasy (Sakaguchi, 2001), Aeon Flux (Kusama, 2005) and most recently Avatar (Cameron, 2009). The focus of digital visual game design seems aimed towards an essential realism, but why this search for the most realistic? Early games were less concerned with the realism of the space or the characters and more on the idea of game and
competition, for example Space Invaders (Taito, 1978), Pac Man (Namco, 1980) and Donkey Kong (Nintendo, 1981). Has the goal shifted towards the user having a more connected experience or relationship with the virtual or gameworld? If they space is a simulation of the real world do we engage less with the concept of a game and more towards the concept of being able to relate to the space. Bull and Back (2004) would argue that of the human senses “vision is the most ‘distancing’ one” (Bull & Back, 2004, p. 4), revealing only what is real and what is. The goal has evolved to create a sense of co-presence within film and potentially games; 3D cinema examines the possibility of the image creating a sense of surround and presence, again see Avatar and the new 3D TV from Panasonic. The overall assumption seems to be, that the only way to create a sense of reality within a digitally created world is through the imagery, a kind of simulated panoptic vision. What seems to be forgotten within this quest for immersion is that sound is actually three dimensional and listening is not a simulated experience.
sonic Immersion Sound is inherently physical and we are always immersed in it, even if we focus our listening towards one sonic experience we are still hearing the entire sonic effect of any space. This is then the challenge and the goal for digital game sound designers; to create spaces that accept the whole universality of the ambient space, and be aware of the outside world that will invariably intrude on this design. Therefore sound design must create a sense of displacement or removal from the real, while accepting that the real will equally intrude on the virtual experience. Similarly digital game designers must address the issue of the senses being in their entirety necessary to comprehend a world. Surround sound must then play a part within the design of certain game spaces, for example, first-person shooter (FPS) games. FPS games generally involve a single
51
Sound is Not a Simulation
player navigating through a space; if they are to feel physically immersed the sound must seem all-surrounding. The need for surround sound or immersive experiences must also take into account the physics of sound. Connor (2004) argues that sound is both intensely corporeal, it physically moves us, and paradoxically immaterial, it cannot be grasped. He argues that sound does not simply surround us, it enters us, if loud enough or high enough it can cause pain and damage; it is seen as tied to emotion more so than sight which is seen as neutral. Within social theory sight has overwhelmed the senses; the epistemological status of sight over sound has crossed over to many disciplines including digital game design. In Simmel’s 1886 work Sociological Aesthetics (as cited in Frisby, 2002), he argues that vision gave a fuller expression to the fragmented city, the eye if “adequately trained” perceives all of a space. This merging of all visual signals suggests that we do not see in parts but in total. Simmel saw sound as intrusive to the perfection of the visible world; it was the profusion of sounds that distracted one from the beauty of the modern urban space. Tonkiss (2004) argues that within modern sociology the goal was to flatten the city, to will sound to silence, to order it. Tonkiss suggests that vision is spectacle, whereas sound is atmosphere and she argues that sound offers us a sense of depth and perspective.
sOUND MEtHODOLOGY AND ANALYsIs In order to identify what is significant about a soundscape one must adopt a multi- method approach. One method is soundwalking created by Hildegard Westerkamp and Murray Schafer in the 1970s. Westerkamp’s use of this method involved asking participants to move through an area that was known to them and recording places of significance. These recordings would later become part of radio art works or installations.
52
The soundwalk technique has been adopted by different researchers for numerous projects around the world since the seventies. Most recently Adams adopted the soundwalking method for the Positive Soundscapes Project in 2006. The purpose of the research was to develop a holistic approach to studying the soundscape. The project invited people to engage in listening to their soundscape and then identify sounds of importance. Adams adopted Schafer’s terminology of keynote sounds, soundmarks and sound signals as analytical models in which to assess the data. This method in itself does not clarify contextual or social meaning so we must explore other qualitative approaches such as field research and interviews, and deciding which qualitative paradigm will best suit this investigation. Traditional sociological methods should play a part in the exploration of meaning and construction of sound. In Adams research, when “prompted to consider spatial layout” (2009, p. 7) the respondents tried to identify the sounds that they heard in the same way they would objects. This proved problematic as the participants had no vocabulary to describe the soundscape or its meaning. Simply focusing on identifying sounds and their meaning may limit the explanation or interpretation of cultural or social meaning. Therefore other methods must be incorporated into the exploration of the soundscape that enable the researcher to comprehend the ubiquity of the sound environment. Interviews both structured and open ended allow for the retrieval of information beyond the specifics of description. Adopting a soundwalking method alongside personal narrative interviews or life history interviews can connect meaning to hearing. Allowing a participant a longer time to consider their sound environment, such as having them notate or record over a period of time, may reveal anamnesis experiences. This is where a sound can evoke a memory or sensation of a past experience. This is not as subjective as it may seem, the sound track in films—particularly the leitmotif—are
Sound is Not a Simulation
often used to refer to a previous part of the film causing a kind of anamnesis in the listener (Augoyard & Torgue, 2006; Chion, Gorbman, & Murch, 1994). Sounds become tied to experiences and therefore have a meaning beyond a description of sound and effect. Our participant, in having a longer time to record or document these kinds of experiences will allow for a further insight into what certain sounds can trigger. Riessman (1993) argues that in the act of telling there is an inevitable gap between the experience and the telling: the sound methods allow for the participant to embody themselves in the narrated space, as they are situated in the environment to which they are referring to. What these combined methods may reveal lie not in how we listen to sound but what we hear when we actively think about listening. That in itself may highlight how much active listening happens in a person’s life and if it turns out that there is, quite a lot heard in an individual’s day to day experiences we must consider sound more actively in the design of digital soundscapes conversely, if we reveal that sound plays only a minor part in a person’s relationship to his environment we may have to re-think how sound, beyond music, should be part of a digital game space. Sequeira, Specht, Hamalainen, and Hugdahl’s (2008) research on the hearing impaired noted that clarity is essential in picking up the minutiae within the complexity of sounds, as issues can occur when ambient sound levels are too high. The comprehension of language becomes more difficult when we try to distinguish dialogue which is surrounded by high levels of background sounds. Equally, Sanchez and Lumbreras’s (1999) research in the design of digital gameworlds for the blind highlighted the need for 3D audio interfaces as a method in which to navigate space. They argue that users, when deprived of the sense of sight, are able to recognise spatiality and “localise specific points in 3D space, concluding that navigating space through sound can be a precise task for blind people” (1999, p. 1).
For digital game sound this does not necessarily seem an important issue, the ambient soundscape rarely includes high levels of conversational sound and game designers rarely design for the blind. Yet in cities and urban centers, vocal sounds and directional sounds are one of the dominating sound and spatial characteristics of the environment. There is interplay between vocal sounds and architecture; they will resonate at different frequencies depending on the construction of the space. Thus understanding how people distinguish sounds, such as vocals amongst a variety of other sounds may be relevant if a designer wishes to include this soundtrack of reality into sound design for gaming. Equally we can make choices in what direction we choose to go to based on acoustic as well as visual information. This could be explored through a series of listening projects whereby a focus group must listen to different sets of sounds while trying to engage in other activities. If the level of information and not volume is increased over time, one could ascertain how much information we can process simultaneously while trying to complete tasks.
contextualizing Game space Understanding that there are a variety of ways to experience the gameworld is a necessary condition to deciding what soundscape should or could be placed within this virtual space. What is the operant behaviour of the gamer, what is the participation level and how much control in the gameworld does the player have? Finally how does one contextualise oneself within the world? Grimshaw and Schott (2007) noted that there was a feedback “for operant behaviours (panting breaths is a good indicator of the player’s energy level) (2007, p. 475). In examining FPS games, we see that sound is predominantly responsive and reactive, rather than passively situated in the background, and this is a key component to this type of gaming. We may hear the dying groans of another wounded warrior in FPS games, but we
53
Sound is Not a Simulation
do not hear the voices of hundreds of men dying or in pain, a sound that would exist in a real war. Our experiences of explosions are controlled lest we be deafened, but where is the artillery constantly humming over the horizon, the perpetual whump, whump of helicopters marking or spotting territory? Jørgensen (2008) argues that symbolic sounds are key components in Player V Player games, more so than background. For her, game context is key: what kind of game is it and what type of space does the avatar inhabit? Jorgensen’s research focuses on the situation oriented approach which interprets sounds in reference to events, rather than object orientated perspective. She argues that the gamer must understand the rules of the system in order to both manipulate it and understand that it “can affect individual actions” (Jørgensen, 2008, p. 2). This concept reflects Blumer’s (1986) symbolic interactionist approach, where humans “define each other’s actions instead of merely reacting to each other’s actions” (p. 79). The other person in this case is the gameworld. There may be several schools of thought on sound within gaming. If the sound is too real, would it terrify the gamer, distract them, annoy them, or just confuse them? Both Schafer and Smith have looked at the history of the soundscape and analysed the possible cause and effect of certain soundscapes on the human condition (Schafer, 1977; Smith, 1999, 2004). However, a new research model is needed to identify how certain sounds trigger emotive or psychological responses, particularly to the soundscape that is featured in a large number of games: war sounds. For a conclusive multi-method we must first decide what is actually needed in a digital game space. For example, if the game has no point of free space where the player can actively listen to their environment, is it necessary for a detailed soundscape? This question may be answered by the questionnaire approach; a series of semi structured interviews may reveal how people hear a space that they only traverse through. This type
54
of interview allows the interviewer a certain level of control which directs the interviewee down particular paths. Equally it allows the interviewee to expand on themes outside the limits of the question, which can reveal unexpected information (Bryman, 2008).
the Mapped soundscape If we were to map the soundscape of a city where would we start? Would we first categorize it, a heading from loudest to quietest or might we break it up into specific human sounds, crowds, individuals, groups of five or more, age related or gender specific? Females have a different tonality to their voices compared to men, children have higher pitched voices to adults, and teenagers are louder than everybody. Then we refer to acoustics, how different do people sound on a pedestrian street as compared to a car filled street or even a park? We can then examine the architecture of the space, the height of the buildings their position and how this might change the reverberant space. Then we could move on to city noises, for example trams running through a city. This would sound at a very low but continuous level, marking specific territories within a city at particular times. Then there is the multitude of cars, trucks and vans and the occasional house alarm, fire alarms, fire trucks, police cars and ambulances sounding off regularly throughout the day, reminding us of sickness, danger and intrusion. The continuous hum of traffic that never quite stops, but it shifts in decibel level throughout the day and sits alongside a cacophony of beeping horns. There is the opening and shutting of thousands of doors onto streets, which might include the hiss of sliding doors, the beeping signals at pedestrian traffic lights, or a robotic voice counting down till we can cross the street. These sounds are part of the ambient soundscape of most cities, but they are still just a small part of the overall sound. Maybe we think we have not heard the sound of a million footsteps pounding a street—it is such a
Sound is Not a Simulation
huge part of the murmur of a city that we no longer distinguish it from the background noise—yet if it stopped… we would notice the silence. The street hawkers and homeless, a perpetual cry of, “What do you want?”, “Can you give?”, “Have you got any change?”, “Will you buy?”, Specific sound markers in Dublin are, “flowers get your flowers, get your fruit, get your veg, paper, evenin’ paper, any money for a hostel”. These oral announcements could also be considered part of the ambient sound track of the city. They would in fact be the soundmarkers for particular urban spaces. This multitude of sound still leaves out the sounds related to the outside or inside acoustics created by structures and objects such as buildings, cars, trains or metro stations. If one moves to what urban dwellers consider the apparently quiet soundscape of the natural world, we find a multitude of sounds connected to the society of animals, from mating cries to hunting calls as well as the sound of eating and foraging, flying, climbing and running. There is the ambient sound of wind through trees, grass or wood bending, rain storms, flowing rivers, rippling water, small streams, and all of this situated in one small area. Now relate this minimal soundscape to sounds within gaming. Such a comparison might lead us to ask how we can experience a real, or significantly close to real, soundscape in a virtual world if the sound design is limited to “character or interface sounds” (Grimshaw & Schott, 2007). This description might be considered too linear and too connected to time and human activities. The ability to comprehend space and the sounds within it are not based entirely on the ability to hear, it is also based on the cultural and social context of both the sounds we hear and our interpretation of them. Blesser and Salter (2009) would argue that we cannot interpret and construct sonic architecture without accepting the cultural relativism of the sensory experience. Therefore in my description of the urban rural soundscape I cannot claim to be objective; my choice of sounds relate to my experience of par-
ticular spaces, my interpretation of these sounds lie in my education, upbringing, and the socially constructed meanings that are inevitably tied to certain sounds. We again return to what Augoyard and Torgue (2006) would consider the inherent problem of describing or analysing a soundscape: the subjectivity issue. If each group or individual perceives sounds differently, how can we generalise when constructing a soundscape? This argument could cross over to many disciplines, within the arts it is generally understood that a work of art is best understood by the artist who made it. Yet the artist accepts that their work will be interpreted differently by every person that sees it. So what makes a great work of art? Is it tied to cultural phenomena, can a particular work be representative of a particular time? Do people understand the meaning because it resonates with what is happening at a particular moment, globally, politically and socially? It is not enough to dismiss understanding how the individual experiences sound because it is subjective, we must explore how people understand sound in particular places at particular times and then look for similarities between other places and people. Then perhaps we can generalise in the construction of digital sound design based on data that reveals particular generalities.
cONcLUsION The interpretation and meaning of sound alters in relation to personal, historical and cultural experiences, as well as the context of our auditory experience. The physicality of sound can alter our perception of the space in which we hear it, expanding or contracting the landscape and shaping our psychological and sociological response to place. If we wish to construct a digital soundscape which simulates reality and creates the sense of immersion, a study of the sociological impact of the soundscape must be undertaken. However the
55
Sound is Not a Simulation
consideration of what defines reality and experience must also be explored. As mentioned earlier in the text the simulated soundscape of war games are not based on the real soundscape of a war zone, but on a sound designer’s definition of war sounds. What definition of reality are we measuring this soundscape of virtual worlds against, and how real do we want our virtual environments to be? Most of the environments we experience within games are spaces which we may never experience in reality. Our experience of certain soundscapes may be understood in relation to other media representations: television, Internet and cinema. The digital game soundscape then becomes a construct of definitions rather than a simulated reality. If we are trying to simulate a sense of reality in gaming we must consider how real we wish to go. Grimshaw (2007) argues that it is only through the audification of gaming that we actually simulate the idea of immersion. This implies that sound in itself provides a sense of reality whether or not the sound is based on reality. So what is it about the physical aspects of sound that create a sense of being elsewhere? It is not enough to suggest that because sound is physical it creates a sense of immersion. Sound must be understood beyond the physical, a language must be developed as a result of empirical research which explores the sociological phenomena of sound. Thibaud (1998) suggests that we must create a “praxiology” of sound from the natural soundscape before we construct artificial soundscapes. He also argues that beyond just meaning and interpretation, sound can and does affect our choices; we pick up “information displayed by the environment in order to control actions (such as locomotion or manipulation) […] thus, the environmental properties and the actor/perceiver activities cannot be disassociated: they shape each other” (Thibaud, 1998, p. 2). Sound can be both active and passive and this will affect our response to it. Driving a car, for example, might be considered a passive produc-
56
tion of sound, we have no choice in the sound the engine makes, but beeping a horn is active sound making. Thus sound production has an implicit message the interpretation of which might be subjective. Whether it is perceived as positive or negative can depend on the intention. It may also affect behaviour, do we choose to move out of the way of a vehicle or allow it to stimulate anger or other emotive responses. This active sound does not simply reference the acoustics of space or a description of noise; it carries a message, a description of a situation that has social and cultural context. If, as Thibaud (1984) suggests sound is not a “mere epiphenomenon or secondary consequence of activity” (p. 4) then we must consider that all sound has meaning, it is how to deconstruct that meaning that will allow for a clearer understanding of the soundscape. With this understanding we can construct digital soundscapes which will challenge the perception that the image is what gives the illusion of the real.
rEFErENcEs Adams, M. (2009). Hearing the city: Reflections on soundwalking. Qualitative Research, 10, 6–9. Adams, M., Cox, T., Moore, G., Croxford, B., Refaee, M., & Sharples, S. (2006). Sustainable soundscapes: Noise policy and the urban experience. Urban Studies (Edinburgh, Scotland), 43(13), 2385. doi:10.1080/00420980600972504 Anderson, P. W. S. (2002). Resident evil [Motion picture]. Munich, Germany: Constantin Film. Augoyard, J., & Torgue, H. (2006). Sonic experience: A guide to everyday sounds (illustrated ed.). Montreal, Canada: McGill-Queen’s University Press.
Sound is Not a Simulation
Bijsterveld, K. (2004). The diabolical symphony of the mechanical age: Technology and symbolism of sound in European and North American noise abatement campaigns, 1900-40 . In Back, L., & Bull, M. (Eds.), The auditory culture reader (1st ed., pp. 165–190). Oxford, UK: Berg. Bijsterveld, K. (2008). Mechanical sound: Technology, culture, and public problems of noise in the twentieth century. Cambridge, MA: MIT Press. Blesser, B., & Salter, L. (2009). Spaces speak, are you listening?: Experiencing aural architecture. Cambridge, MA: MIT Press. Blumer, H. (1986). Symbolic interactionism. Berkeley: University of California Press. Bryman, A. (2008). Social research methods (3rd ed.). Oxford, UK: Oxford University Press. Bull, M. (2000). Sounding out the city: Personal stereos and the management of everyday life. Oxford, UK: Berg. Bull, M., & Back, L. (2004). The auditory culture reader (1st ed.). Oxford, UK: Berg. Cabrera Paz, J., & Schwartz, T. B. M. (2009). Techno-cultural convergence: Wanting to say everything, wanting to watch everything. Popular Communication: The International Journal of Media and Culture, 7(3), 130. Cameron, J. (Director). (2009). Avatar [Motion picture]. Los Angeles, CA: 20th Century Fox. Lightstorm Entertainment, Dune Entertainment, Ingenious Film Partners [Studio]. Chion, M., Gorbman, C., & Murch, W. (1994). Audio-vision. New York: Columbia University Press. Cohen, L. (2005). The history of noise [on the 100th anniversary of its birth]. IEEE Signal Processing Magazine, 22(6), 20–45. doi:10.1109/ MSP.2005.1550188
Connor, S. (2004). Edison’s teeth: Touching hearing. In V. Erlmann (Ed.), Hearing cultures: Essays on sound, listening, and modernity (English ed., pp. 153-172). Oxford, UK: Berg. de Certeau, M. D. (1988). The practice of everyday life. Berkeley: University of California Press. Donkey kong [Computer game]. (1981). Kyoto: Nintendo. Drobnick, J. (2004). Aural cultures. Toronto: YYZ Books. Epstein, M. (2009). Growing an interdisciplinary hybrid: The case of acoustic ecology. History of Intellectual Culture, 3(1). Retrieved December 29, 2009, from http://www.ucalgary.ca/hic/issues/vol3/9. Fake engine noises added to hybrid and electric cars to improve safety. (2008). Retrieved January 10, 2010, from http://www.switched. com/2008/06/05/fake-engine-noises-added-tohybrid-and-electric-cars-to-improve/. Feld, S. (2004). A rainforest acoustemology . In Bull, M., & Back, L. (Eds.), The auditory culture reader (1st ed., pp. 223–240). Oxford, UK: Berg Publishers. Frisby, D. (2002). Cityscapes of modernity: Critical explorations. Cambridge, UK: Polity. Goffman, E. (1959). The presentation of self in everyday life (1st ed.). New York: Anchor. Grimshaw, M. (2007). Sound and immersion in the first-person shooter. In Proceedings of 11th International Conference on Computer Games: AI, Animation, Mobile, Educational and Serious Games.Published to CDROM. Grimshaw, M., Lindley, C. A., & Nacke, L. (2008). Sound and immersion in the first-person shooter: Mixed measurement of the player’s sonic experience. In Proceedings of Audio Mostly Conference 21-26.
57
Sound is Not a Simulation
Grimshaw, M., & Schott, G. (2007). Situating gaming as a sonic experience: The acoustic ecology of first-person shooters. In Proceedings of Situated Play, 24-28. Hodgkinson, G. (2009). The seduction of realism. In Proceedings of ACM SIGGRAPH ASIA 2009 Educators Program (pp. 1-4). Yokohama, Japan: The Association for Computing Machinery. Jennings, P. (2009). WMG: Professor Paul Jennings. Retrieved December 30, 2009, from http:// www2.warwick.ac.uk/fac/sci/wmg/about/people/ profiles/paj/. Jørgensen, K. (2008). Audio and gameplay: An analysis of PvP battlegrounds in World of Warcraft. GameStudies. Retrieved January 10, 2010, from http://gamestudies.org/0802/articles/jorgensen. Kusama, K. (Director). (2005). Aeon flux[Motion picture]. Hollywood, CA: Paramount. Lefebvre, H. (2004). Rhythmanalysis: Space, time and everyday life. Continuum. Lumbreras, M., & Sánchez, J. (1999). Interactive 3D sound hyperstories for blind children. In Proceedings of the SIGCHI conference on Human factors in computing systems: the CHI is the limit (pp. 318-325). Pittsburgh, PA: ACM. MacLaran, A. (2003). Making space: Property development and urban planning. London: Hodder Arnold. Manning, P. (1992). Erving Goffman and modern sociology. Standord, CA: Stanford University Press. Pac man [Computer game]. (1980). Tokyo, Japan: Namco. Riessman, D. C. K. (1993). Narrative analysis (1st ed.). Los Angeles: Sage. Russolo, L. (1913). Russolo: The art of noises. Retrieved December 30, 2009, from http://120years. net/machines/futurist/art_of_noise.html.
58
Sakaguchi, H. (Director). (2001). Final fantasy [Motion picture]. Los Angeles: Columbia. Schafer, R. M. (1977). The tuning of the world. Toronto: McClelland and Steward. Sequeira, S. D. S., Specht, K., Hämäläinen, H., & Hugdahl, K. (2008). The effects of different intensity levels of background noise on dichotic listening to consonant-vowel syllables. Scandinavian Journal of Psychology, 49(4), 305–310. doi:10.1111/j.1467-9450.2008.00664.x Simmel, G. (1979). The metropolis and mental life. Retrieved February 1, 2010, from http://www. blackwellpublishing.com/content/BPL_Images/ Content_store/Sample_chapter/0631225137/ Bridge.pdf. Smith, B. R. (1999). The acoustic world of early modern England: Attending to the o-factor (1st ed.). Chicago: University Of Chicago Press. Smith, B. R. (2004). Tuning into London c.1600 . In Bull, M., & Back, L. (Eds.), The auditory culture reader (1st ed., pp. 127–136). Oxford, UK: Berg. Space invaders [Computer game]. (1978). Tokyo, Japan: Taito. Thibaud, J. (1998). The acoustic embodiment of social practice: Towards a praxiology of sound environment . In Karlsson, H. (Ed.), Proceedings of Stockholm, Hey Listen! (pp. 17–22). Stockholm: The Royal Swedish Academy of Music. Thompson, J. B. (1995). The media and modernity. Standford, CA: Stanford University Press. Tonkiss, F. (2004). Aural postcards: sound, memory and the city . In Back, M., & Bull, L. (Eds.), The auditory culture reader (1st ed., pp. 303–310). Oxford, UK: Berg. West, S. (Director). (2001). Laura Croft:Tomb raider [Motion picture]. Hollywood, CA: Paramount.
Sound is Not a Simulation
Working Group Noise Eurocities. (n.d.). Retrieved January 10, 2010, from http://workinggroupnoise. web-log.nl/.
KEY tErMs AND DEFINItIONs Holistic: In order to understand the whole of a system, one must look at the parts within it that make it up. Within sociology, Durkheim developed a concept of holism which is in opposition to methodological individualism. Immersion: To be completely surrounded by sound. Mediatization: Sonia Livingstone’s definition of Mediatization is for me the most accurate because it refers “to the meta process by which every day practices and social relations are increasingly shaped by mediating technologies and media organisations” (http://www.icahdq.org/ conferences/presaddress.asp par. 3). Schizophonic: Murray Schafer describes the term schizophonic as the split between an original sound and an electroacoustic reproduction in a soundscape. I am using it as a metaphor for a split between two types of listening spaces: If one is listening to music while traversing through a real space the attention is split in comprehension
between the real world space and the virtual soundscape. Social construction of space: Social constructivists examine ways in which individuals and groups participate in the creation of their perceived social reality. In this context, I am focusing on how society can change their perceived space through sound, either by how they listen to or produce sound in a space. Sonic Architecture: The study of the acoustic affect of objects such as building’s, interior and exterior, on space. Equally, sonic architecture explores how people can construct sonic structures or challenge the sounds of places by creating their own sonic space. Soundscape: Refers to both natural and manmade sounds that immerse an environment. Soundwalking: A soundwalk is a journey where the objective is to discover an environment by listening to it. Symbolic Interactionist: The study of microscale social interaction. It is seen as a process that informs and forms human conduct, the premise being that humans beings act on and upon things based on the meaning these things have, things being defined as physical objects such as chairs, trees, phones, and human beings, mothers, shop clerks and so forth.
59
60
Chapter 4
Diegetic Music:
New Interactive Experiences Axel Berndt Otto-von-Guericke University, Germany
AbstrAct Music which is performed within the scene is called diegetic. In practical and theoretical literature on music in audio-visual media, diegetic music is usually treated as a side issue, a sound effect-like occurrence, just a prop of the soundscape that sounds like music. A detailed consideration reveals a lot more. The aim of this chapter is to uncover the abundance of diegetic occurrences of music, the variety of functions they fulfill, and issues of their implementation. The role of diegetic music gains importance in interactive media as the medium allows a nonlinearity and controllability as never before. As a diegetic manifestation, music can be experienced in a way that was previously unthinkable except, perhaps, for musicians.
INtrODUctION Dealing with music in audio-visual media leads the researcher traditionally to its non-diegetic occurrence first, that is offstage music. Its interplay with the visuals and its special perceptual circumstances have been largely discovered and analyzed by practitioners, musicologists, and psychologists. Its role is mostly an accompanying, annotating one that emotionalises elements of the plot or
DOI: 10.4018/978-1-61692-828-5.ch004
scene, associates contextual information, and thus enhances understanding (Wingstedt, 2008). Comparatively little attention has been given to diegetic music. As its source is part of the scene’s interior (for example, a performing musician, a music box, a car radio), it is audible from within the scene. Hence, it can exert an influence on the plot and acting and is frequently even an inherent part of the scenic action. In interactive media it can even become an object the user might be able to directly interact with. This chapter addresses the practical and aesthetic issues of diegetic music. It clarifies
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Diegetic Music
Figure 1. A systematic overview of all forms of diegetic music
differences to non-diegetic music regarding inner-musical properties, its functional use, and its staging and implementation. Particular attention is paid to interactivity aspects that hold a variety of new opportunities and challenges in store, especially in the context of modern computer games technology. This directly results in concrete design guidelines. These show that adequate staging of diegetic music requires more than its playback. The problem area comprises the simulation of room acoustics and sound radiation, the generation of expressive performances of a given compositional material, even its creation and variation in realtime, amongst others. The complexity and breadth of these issues might discourage developers. The effort seems too expensive for a commercial product and is barely invested. Game development companies usually have no resources available to conduct research in either of these fields. But in most cases, this is not even necessary. Previous and recent research in audio signal processing and computer music created many tools, algorithms, and systems. Even if not developed for the particular circumstances of diegetic music, they approach or even solve similar problems. It is a further aim of this chapter to uncover this fallow potential. This may inspire developers to make new user experiences possible, beyond the limitations of an excluded passive listener.
The key to this is interactivity. However, different types of games allow different modes of interaction. Different approaches to diegetic music follow, accordingly. To lay a solid conceptual basis, this chapter also introduces a more differentiating typology of diegetic music and its subspecies, which is outlined in Figure 1. The respective sections expand on the different types. Before that, a brief historical background and a clarification of the terminology used are provided.
Where Does It come From? Early examples of diegetic music can be found in classic theatre and opera works, for instance, the ball music in the finale of W.A. Mozart’s Don Giovanni (KV 527, premiered in 1787) which is performed onstage, not from the orchestra pit. Placing musicians onstage next to the actors may hamper dialog comprehensibility. To prevent such conflicts, diegetic music was often used as a foreground element that replaces speech. It wasn’t until radio plays and sound films offered more flexible mixing possibilities that diegetic music grew to be more relevant for background soundscape design (for example, bar music, street musicians). Such background features could now be set on a significantly lower sound level to facilitate focusing the audience’s attention on the spoken text, comparable to the well-known Cocktail Party Effect (Arons, 1992).
61
Diegetic Music
A further form of occurrence evolved in the context of music-based computer games, having its origins in the aesthetics of music video clips: music that is visualized on screen. In this scenario, the virtual scene is literally built up through music. Musical features define two- or three-dimensional objects, their positioning, and set event qualities (for example, bass drum beats may induce big obstacles on a racing track or timbral changes cause transitions of the color scheme). The visualizations are usually of an aesthetically stylized type. Thus, the scenes are barely (photo-)realistic but rather surrealistic. Typical representatives of music-based computer games are Audiosurf: Ride Your Music (Fitterer, 2008), Vib-Ribbon (NanaOnSha, 1999), and Amplitude (Harmonix, 2003). However, music does not have to be completely precomposed for the interactive context. Games like Rez (Sega, 2001) demonstrate that player interaction can serve as a trigger of musical events. Playing the game creates its music. One could argue that this is rather a very reactive non-diegetic score. However, the direct and very conspicuous correlation of interaction and musical event and the entire absence of any further sound effects drag the music out of the “off” onto the stage. The surrealistic visuals emphasize this effect as they decrease the aesthetic distance to musical structure. In this virtual world, music is the sound effect and is, of course, audible from within the scene, hence diegetic. The conceptual distance to virtual instruments is not far as is shown by the game Electroplankton (Iwai, 2005) and the lively discussion on whether it can still be called a game (Herber, 2006). In the contexts of Jørgensen’s (2011) terminology discussion, a more precise clarification of the use of diegetic and non-diegetic in this chapter is necessary. The diegesis, mostly seen as a fictional story world, is here used in its more general sense as a virtual or fictional world detached from the conventional story component. It is rather the domain the user interacts with either directly (god-like) or through an avatar which itself is part
62
of the diegesis. The diegesis does not necessarily have to simulate real world circumstances. The later discussion on music video games1 will show that it does not have to be visual either, even if visually presented. Again, the diegesis in interactive media is the ultimate interaction domain, not any interposed interface layer. Keyboard, mouse, gamepad, and graphical user interface elements like health indicator and action buttons are extradiegetic. They serve only to convert user input into diegetic actions or to depict certain diegetic information. The terms diegetic and non-diegetic in their narrow sense describe the source domain of a described entity: diegesis or extra-diegesis. Diegetic sound comes from a source within the diegesis. Many theorists add further meaning to the terms regarding, for instance, the addressee. A soldier in a strategy game may ask the player directly where to go. As the player may also adapt his playing behaviour to non-diegetic information (a musical cue warns of upcoming danger), these can be influential for the diegesis. Such domaincrossing effects are unthinkable in linear, that is non-interactive, media. The strict inside-outside separation of the traditional terminology is, of course, incapable of capturing these situations and it may never be meant to do so. Galloway (2006) deals with this subject in an exemplary way. This chapter does not intend to participate in this discussion. For the sake of clarity, the narrow sense of the terminology is applied in this chapter. This means that the terms only refer to the source domain, not the range of influence. Diegetic is what the mechanics of the diegesis (world simulation, in a sense) create or output. If the superior game mechanics produce further output (for example, interface sounds or the musical score) it is declared non- or extra-diegetic. This is also closer to the principles of the technical implementation of computer games and may make the following explanations more beneficial.
Diegetic Music
ONstAGE PErFOrMED MUsIc The primal manifestation of diegetic music is music that is performed within the scene, either as a foreground or background artifact. As such, it usually appears in its autonomous form as a self-contained and very often a pre-existent piece. The most distinctive difference between diegetic music and its non-diegetic counterpart is that the latter cannot be considered apart from its visual and narrative context. Likewise, the perceptual attitude differs substantially. Foreground diegetic music is perceived very consciously, comparable to listening to a piece of music on the radio or a concert performance. Even background diegetic music that serves a similar purpose as non-diegetic mood music is comprehended differently. While mood music describes an inner condition (What does a location feel like?) background diegetic music contributes to the external description (What does the location sound like?) and can be mood-influential only on a general informal level (They are playing sad music here!).
Functions Therefore, the role of background diegetic music is often regarded as less intrinsic. It is just a prop, an element of the soundscape, which gives more authenticity to the scenario on stage. As such it serves well to stage discos, bars, cafés, street settings with musicians, casinos (see Collins, Tessler, Harrigan, Dixon, & Fugelsang, 2011) for an extensive description of sound and music in gambling environments) and so forth. However, it does not have to remain neutral, even as a background element. It represents the state of the environment. Imagine a situation where the street musicians suddenly stop playing. This is more than an abrupt change of the background atmosphere, it is a signal indicating that something happened that stopped them playing, that something has fundamentally changed.
Conversely, it can also be that dramatic events happen, maybe the protagonist is attacked, but the musical background does not react. Instead, it may continue playing jaunty melodies. Such an indifferent relation between foreground and background evokes some kind of incongruence. This emphasizes the dramaturgical meaning of the event or action. Moreover, it is sometimes understood as a philosophical statement indicating an indifferent attitude of the environment. Whatever happens there, it means nothing to the rest of the world: “life goes on” (Lissa, 1965, p.166). Even though the source of diegetic music is part of the scene it does not have to be visible. The sound of a gramophone suffices to indicate its presence. In this way diegetic music, just like diegetic sound effects, gathers in non-visible elements of the scene and blurs the picture frame, which is particularly interesting for fixed-camera shots. It associates a world outside the window and beyond that door which never opens. Its role as a carrier of such associations takes shape the more music comes to the fore because the linkage to its visual or narrative correlative is very direct and conspicuous (The guy who always hums that melody!). Furthermore, when diegetic music is performed by actors, and thereby linked to them, it can become a means of emotional expression revealing their innermost condition. The actor can whistle a bright melody, hum it absentmindedly while doing something else, or articulate it with sighing inflection. Trained musicians can even change the mode (major, minor), vary the melody, or improvize on it. The more diegetic music becomes a central element of the plot the more its staging gains in importance. Did the singer act well to the music? Does the fingering of the piano player align with the music? It can become a regulator for motion and acting. The most obvious example is probably a dancing couple. Very prominent is also the final assassination scene in Alfred Hitchcock’s (1956) The Man Who Knew Too Much. During a concert
63
Diegetic Music
performance of Arthur Benjamin’s Cantata Storm Clouds the assassin tries to cover his noise by shooting in synch with a loud climactic cymbal crash. Even screaming Doris Day is perfectly in time with the meter of the orchestra.
Design Principles However, when a musical piece is entirely performed in the foreground, it creates a problem. It slows the narrative tempo down. This is because change processes take more time in autonomous music than on the visual layer, in films as well as in games. In contrast to non-diegetic music, where changes are provoked and justified by the visual and narrative context, diegetic music has to stand on its own. Its musical structure has to be self-contained, hence, change processes need to be more elaborate. Such compositional aspects of non-diegetic film music and its differences to autonomous music have been discussed already by Adorno and Eisler (1947). For an adequate implementation of diegetic music, further issues have to be addressed. In contrast to non-diegetic music, it is subject to the acoustic conditions of the diegesis. A big church hall, small bed room, or an outdoor scene in the woods, each environment has its own acoustics and resonances. Ever heard disco music from outside the building? The walls usually filter medium and high frequencies, the bass is left. This changes completely when entering the dance floor. Diegetic music as well as any other sound effect cannot, and must not, sound like a perfectly recorded and mixed studio production. A solo flute in a large symphony orchestra is always audible on CD but gets drowned in a real life performance. According to the underlying sound design there might, nonetheless, be a distinction between foreground and background mixing that does not have to be purely realistic. Furhter discussion of this can be found, for instance, in Ekman (2009). The sound positioning in the stereo or surround panorama also differs from that of studio record-
64
ings. Diegetic music should come from where it is performed. The human listener is able to localize real world sound sources with deviations down to two degrees (Fastl & Zwicker, 2007). Depending on the speaker setting, this can be significantly worse for virtual environments. But even stereo speakers provide rough directional information. Localization gets better again when the source is moving or the players are able to change their relative position and orientation to the source. In either case the source should not “lose” its sound or leave it behind when it moves. It would, as a consequence, lose presence and believability. Positioning the music at the performer’s location in relation to the listener is as essential as it is for every further sound effect. But up to now only a very primitive kind of localization has been discussed: setting the sound source at the right place. In interactive environments, the player might be able to come very close to the performer(s). If it is just a little clock radio, a single sound source may suffice. But imagine a group of musicians, a whole orchestra, the player being able to walk between them, listening to each instrument at close range. Not to forget that the performer, let us say a trumpet player, would sound very different at the front than from behind, at least in reality. Each instrument has its individual sound radiation angles. These are distinctively pronounced for each frequency band. The radiation of high-frequency partials differs from that of medium and low frequencies, a fact that, for instance, sound engineers have to consider for microphonics (Meyer, 2009). How far do developers and designers need to go? How much realism is necessary? The answer is given by the overall realism that the developers aim for. Non-realistic two-dimensional environments (cartoon style, for example) are comparably tolerant of auditory inconsistencies. Even visually (photo-) realistic environments do not expect realistic soundscapes at all. Hollywood cinematic aesthetics, for instance, focus on the affect not on realism. Ekman (2009) describes further situations
Diegetic Music
where the human subjective auditive perception differs greatly from the actual physical situation. Possible causes can be the listener’s attention, stress, auditory acuity, body sounds and resonances, hallucination and so forth. All this indicates that diegetic music has to be handled on the same layer as sound effects and definitely not on the “traditional” non-diegetic music layer. In the gaming scenario, it falls under the responsibility of the audio engine that renders the scene’s soundscape. Audio Application Programming Interfaces (APIs) currently in use are, for instance, OpenAL (Loki & Creative, 2009), DirectSound as part of DirectX (Microsoft, 2009), FMOD Ex (Firelight, 2009), and AM3D (AM3D, 2009). An approach to sound rendering based on graphics hardware is described by Röber, Kaminski, and Masuch (2007) and Röber (2008). A further audio API that is especially designed for the needs of mobile devices is PAudioDSP by Stockmann (2007). It is not enough, though, to play the music back with the right acoustics, panorama, and filtering effects. Along the lines of “more real than reality”, it is often a good case to reinforce the live impression by including a certain degree of defectiveness. The wow and flutter of a record player may cause pitch bending effects. There can be interference with the radio reception resulting in crackling and static noise. Not to mention the irksome things that happen to each musician, even to professionals, at live performances: fluctuation of intonation, asynchrony in ensemble play, and wrong notes, to name just a few of them. Those things hardly ever happen on CD. In the recording studio, musicians can repeat a piece again and again until one perfect version comes out or enough material is recorded to cut down a perfect version during postproduction. But at life performances all this happens and cannot be corrected afterwards. Including them in the performance of diegetic music makes for a more authentic live impression.
Non-Linearity and Interactivity However, in the gaming context in particular this authenticity gets lost when the player listens to the same piece more than once. A typical situation in a game: The player re-enters the scene several times and the diegetic music always starts with the same piece as if the performers paused and waited until the player came back. This can be experienced, for example, in the adventure game Gabriel Knight: Sins of the Fathers (Sierra, 1993) when walking around in Jackson Square. Such a déjà vu effect robs the virtual world of credibility. The performers, even if not audible, must continue playing their music and when the player returns he must have missed parts of it. Another very common situation where the player rehears a piece of music occurs when getting stuck in a scene for a certain time. The performers, however, play one and the same piece over and over again. In some games they start again when they reach the end, in others, the music loops seamlessly. Both are problematic because it becomes evident that there is no more music. The end of the world is reached in some way and there is nothing beyond. A possible solution could be to extend the corpus of available pieces and go through it either successively or randomly in the music box manner. But the pieces can still recur multiple times. In these cases it is important that the performances are not exactly identical. A radio transmission does not always crackle at the same time within the piece and musicians try to give a better performance with each attempt. They focus on the mistakes they made last time and make new ones instead. This means that the game has to generate ever new performances. Examples for systems that can generate expressive performances are: • •
the rule-based KTH Director Musices by Friberg, Bresin, and Sundberg (2006) the machine learning-based YQX by Flossmann, Grachten, and Widmer (2009)
65
Diegetic Music
•
the mathematical music theory-based approach by Mazzola, Göller, and Müller (2002).
Even the expressivity of the performance itself can be varied. This can derive from the scene context (the musician is happy, bored, or sad) or be affected by random deviations (just do it differently next time). Systems to adapt performative expression were developed by Livingstone (2008) and Berndt and Theisel (2008). But modifying performative expression is not the only way to introduce diversity into music. A further idea is to exploit the potential of sequential order, that is, to rearrange the sequence of musical segments. The idea derives from the classic musical dice games which were originally invented by Kirnberger (1767) and became popular through Mozart (1787). The concept can be extended by so-called One Shot segments that can be interposed occasionally amongst the regular sequence of musical segments as proposed within several research prototypes by Tobler (2004) and Berndt, Hartmann, Röber, & Masuch (2006). These make the musical progress appear less fixed. Musical polyphony offers further potential for variance: Building block music2 allows various part settings as not all of them have to play at once. One and the same composition can sound very different by changing the instrumentation (Adler, 2002; Sevsay, 2005) or even the melodic material and counterpoint (Aav, 2005; Berndt et al., 2006; Berndt, 2008). Thus, each iteration seems to be a rearrangement or a variation instead of an exact repetition. Generative techniques can expand the musical variance even more. Imagine a virtual jazz band that improvises all the time. New music is constantly created without any repetition. This can be based on a given musical material, a melody for instance, that is varied. The GenJam system, a genetic approach (Miranda & Biles, 2007), is a well known representative. MeloNet and JazzNet are two systems that create melody ornamenta-
66
tions through trained neural networks (Hörnel, 2000; Hörnel & Menzel, 1999). Based on a graph representation of possible alternative chord progressions (a Hidden Markov Model derivative called Cadence Graph), Stenzel (2005) describes an approach to variations on the harmonic level. Beyond varying musical material it is also possible to generate ever new material. Therefore, Hiller and Isaacsons (1959) have already attempted this through the application of random number generators and Markov chains. This is still common practice today, for example, for melody generation (Klinger & Rudolph, 2006). Next, harmonization and counterpoint can be created for that melody to achieve a full polyphonic setting (Ebcioglu, 1992; Schottstaedt, 1989; Verbiest, Cornelis, & Saeys, 2009). Further approaches to music composition are described by Löthe (2003), Taube (2004), and Pozzati (2009). Papadopoulos & Wiggins (1999) and Pachet and Roy (2001) give more detailed surveys of algorithmic music generation techniques. The nonlinear aspects of diegetic music as they have been discussed up to now omitted one fact that comes along with interactive media. Music, as part of the diegesis, not only influences it but can also be influenced by it, especially by the player. Which player is not tempted to click on the performer and see what happens? In the simplest case a radio is just switched on and off or a song is selected on the music box. Interaction with virtual musicians, by contrast, is more complicated. Two modes can be distinguished: the destructive and the constructive mode. Destructive interaction interferes with the musician’s performance. The player may talk to him, jostle him, distract his attention from playing the right notes and from synchronisation with the ensemble. This may even force the musician to stop playing. Destructive interaction affects the musical quality. A simple way to introduce wrong notes is to change the pitch of some notes by a certain interval. Of course, not all of them have to be changed. The number of changes depends on the
Diegetic Music
degree of disruption. Likewise for the size of the pitch interval: for example, the diatonic neighbor (half and whole step) with small errors and bigger intervals the more the musician is distracted. In the same way rhythmic precision and synchrony can be manipulated. Making musicians asynchronous simply means adding a plain delay that puts some of them ahead and others behind in the ensemble play. The rhythmic precision, by contrast, has to do with the timing of a musician. Does he play properly in time or is he “stumbling”, in other words, unrhythmical? Such timing aspects were described, investigated, and implemented by Friberg et al. (2006) and Berndt and Hähnel (2009) amongst others. As ensemble play is also a form of communication between musicians, one inaccurate player affects the whole ensemble, beginning with the direct neighbor. They will, of course, try to come together again which can be emulated by homeostatic (self-balancing) systems. Such self-regulating processes were, for instance, described by Eldridge (2002) and used for serial music composition. Constructive interaction, by contrast, influences musical structure. Imagine a jazz band cheered by the audience, encouraged to try more adventurous improvisations. Imagine a street musician playing some depressive music. But when giving him a coin he becomes cheerful, his music likewise. Such effects can rarely be found in virtual gaming worlds up to now. The adventure game Monkey Island 3: The Curse of Monkey Island (LucasArts, 1997) features one of the most famous and visionary exceptions. In one scene the player’s pirate crew sings the song “A Pirate I Was Meant To Be.” The player chooses the keywords with which the next verse has to rhyme. The task is to select the one that nobody finds a rhyme for, to bring them back to work. The sequential order of verses and interludes is adapted according to the multiple-choice decisions that the player makes. A systematic overview of this and further approaches to nonlinear music is given by Berndt (2009).
So much effort, such a large and complex arsenal for mostly subsidiary background music? Do we really require all this? The answer is ”no”. This section proposed a collection of tools of which the one or other can be useful for rounding off the coherence of the staging and to strengthen the believability of the music performance. Moreover, these tools establish the necessary foundations for music to be more than a background prop, but to come to the fore as an interactive element of the scene. This opens up the unique opportunity for the player to experience music and its performance in a completely different way, namely close up.
VIsUALIZED MUsIc Beyond visualizing only the performance of music, that is showing performing musicians or sound sources as discussed so far, there is a further possibility: the visualization of music itself. In fact, it is not music as a whole that is visualized but rather a selection of structural features of a musical composition (rhythmic patterns, melodic contour and so on). Moreover, the visual scene must not be completely generated from musical information. Music video games just like music video clips often feature a collage-like combination of realistic and aesthetically stylized visuals. The latter is the focus of this section. The Guitar Hero series (Harmonix, 2006-2009) works with such collage-like combinations. While a concert performance is shown in the background the foreground illustrates the guitar riffs which the player has to perform. PaRappa the Rapper (NanaOn-Sha, 1996) also shows the performers on screen and an unobtrusive line of symbols on top that indicates the type of interaction (which keys to press) and the timing to keep up with the music. In Audiosurf, by contrast, the whole scene is built up through music: the routing of the obstacle course, the positioning of obstacles and items, the color scheme, background objects, and visual effects, even the driving speed. So music
67
Diegetic Music
not only sets visual circumstances but also event qualities. Some pieces induce more difficult tracks than others.
the Musical Diegesis The visual instances of musical features are aesthetically looser in video clips. In the gaming scenario they have to convey enough information to put the game mechanics across to the player. Hence, they have to be aesthetically more consistent and presented in a well-structured way. Often a deviation of the pitch-time notation, known from conventional music scores (pitch is aligned vertically, time horizontally), forms the conceptual basis of the illustrations. Upcoming events scroll from right to left. Its vertical alignment indicates a qualitative value—not necessarily pitch—of the event. The orientation can, of course, vary. Shultz (2008) distinguishes three modes: •
•
•
Reading Mode: corresponds to score notation as previously described and implemented, for example, in Donkey Konga (Namco, 2003) Falling Mode: the time-axis is vertically oriented, the pitch/quality-axis horizontally, upcoming events “drop down” (Dance Dance Revolution by Konami (1998)) Driving Mode: just like falling mode but with the time-axis in z-direction (depth), upcoming events approach from ahead (Guitar Hero).
The illustrations do not have to be musically accurate. They are often simplified for the sake of better playability. In Guitar Hero, for instance, no exact pitch is represented, only melodic contour. Even this is scaled down to the narrow ambit that the game controller supplies. It is, in fact, not necessary to translate note events into some kind of stylization. Structural entities other than pitch values can be indicated as well. In Amplitude, it is the polyphony of multiple tracks (rhythm, vocals,
68
bass, for example) arranged as multiple lanes. Color coding is often used to represent sound timbre (Audiosurf). Other visualization techniques are based on the actual waveform of the recording or on its Fourier transformation (commonly used in media player plug-ins and also in games). For completeness, it should be mentioned that it is, of course, not enough to create only a static scene or a still shot. Since music is a temporal art its visualisation has to develop over time, too. In music video games, as well as in video clips, music constitutes the central value of the medium. It is not subject to functional dependencies on the visual layer. Conversely, the visual layer is contingent upon music, as was already described. Although the visual scene typically does not show or even include any sound sources in a traditional sense (like those described in the previous section), music has to be declared a diegetic entity, even more than the visuals. These is only a translation of an assortment of musical aspects into visual metaphors. They illustrate, comment, concretize, and channel associations which the music may evoke (Kungel, 2004). They simplify conventional visually marked interaction techniques. But the interaction takes place in the music domain. The visuals do not and cannot grasp the musical diegesis as a whole.3 In this scenario the diegesis is literally constituted by music. It is the domain of musical possibilities. In this (its own) world, music is subject to no restrictions. The visual layer has to follow. The imaginary world that derives from this is equally subject to no logical or rational restrictions. The routings of the obstacle courses in Audiosurf run freely in a weightless space: even the background graphics and effects have nothing in common with real sky or space depictions. Practical restrictions, such as those discussed above for onstage performed music (like radio reception interference, wrong notes and so forth), likewise do not exist. Hence, the performative quality can be at the highest stage, that is, studio level.
Diegetic Music
Interactivity in the Musical Domain However, the possibilities to explore these worlds interactively are still severely limited. Often, statically predetermined pieces of music dictate the tempo and rhythm of some skill exercises without any response to whether the player does well or badly. This compares to conventional on-rails shooter games that show a pre-rendered video sequence which cannot be affected by the player whose only task is to shoot each appearing target. A particular piece of music is, here, essentially nothing else but one particular tracking shot through a much bigger world. Music does not have to be so fixed and the player should not be merely required to keep up with it. The player can be involved in its creation: “Music videogames would benefit from an increasing level of player involvement in the music” (Williams, 2006, p.7) The diegesis must not be what a prefabricated piece dictates but should rather be considered as a domain of musical possibilities. The piece that is actually played reflects the reactions of the diegesis to player interaction. An approach to this begins with playing only those note events (or more generally, musical events) that the player actually hits, not those he was supposed to hit. In Rez, for instance, although it is visually an on-rails shooter, only a basic ostinato pattern (mainly percussion rhythms) is predefined and the bulk of musical activity is triggered by the player. Thus, each run produces a different musical output. Williams (2006) goes so far as to state that “it is a pleasure not just to watch, but also to listen to someone who knows how to play Rez really well, and in this respect Rez comes far closer to realising the potential of a music videogame” (p.7). In Rez, the stream of targets spans the domain of musical possibilities. The player’s freedom may still be restricted to a certain extent but this offers a clue for the developers to keep some control over the musical dramaturgy. This marks the upper boundary of what is possible with precomposed
and preproduced material. Further interactivity requires more musical flexibility. Therefore, two different paths can be taken: • •
interaction by musically primitive events interaction with high-level structures and design principles.
Primitive events in music are single tones, drum beats, and even formally consistent groups of such primitives that do not constitute a musical figure in itself (for instance, tone clusters and arpeggios). In some cases even motivic figures occur as primitive events: they are usually relatively short (or fast) and barely variable. The game mechanics provide the interface to trigger them and set event properties like pitch, loudness, timbre, cluster density, for example. Ultimately, this leads to a close proximity of interactive virtual instrument concepts. It can be a virtual replica of a piano, violin, or any instrument that exists in reality. Because of the radically different interaction mode (mouse and keyboard) these usually fall behind their realworld prototypes regarding playability. To overcome this limitation, several controllers were developed that adapt form and handling of real instruments like the guitar controller of Guitar Hero, the Donkey Konga bongos, the turntable controller of DJ Hero (FreeStyleGames, 2009), and not to forget the big palette of MIDI instruments (keyboards, violins, flutes, drum pads and so forth). Roads (1996) gives an overview of such professional musical input devices. But real instruments do not necessarily have to be adapted. The technical possibilities allow far more interaction metaphors, as is demonstrated by the gesture-based Theremin (1924), the sensorequipped Drum Pants (Hansen & Jensenius, 2006), and the hand and head tracking-based Tone Wall/ Harmonic Field (Stockmann, Berndt, & Röber, 2008). Even in the absence of such specialized controllers keyboard, mouse, and gamepad allow expressive musical input too. The challenge, therefore, is to find appropriate metaphors like
69
Diegetic Music
aiming and shooting targets, painting gestural curves, or nudge objects of different types in a two- or three-dimensional scene. Although the player triggers each event manually he does not have to be the only one playing. An accompaniment can be running autonomously in the background like that of a pianist that goes along with a singer or a rock band that sets the stage for a guitar solo. Often repetitive structures (ostinato, vamp, riff) are therefore applied. Such endlessly looping patterns can be tedious over a longer period. Variation techniques like those explained in the previous section can introduce more diversity. Alternatively, non-repetitive material can be applied. Precomposed music is of limited length, hence, it should be sufficiently long. Generated music, by contrast, is subject to no such restrictions. However, non-repetitive accompaniment comes with a further problem: it lacks musical predictability and thereby hampers a player’s smooth performance. This can be avoided. Repetitive schemes can change after a certain number of iterations (for example, play riff A four times, B eight times, and C four times). The changes can be prepared in such a way that the player is warned. A well-known example is the drum roll crescendo that erupts in a climactic crash. Furthermore, tonally close chord relations can relax strict harmonic repetition without losing the predictability of appropriate pitches. The player can freely express himself against this background. But should he really be allowed to do anything? If yes, should he also be allowed to perform badly and interfere with the music? In order not to discourage a proportion of the customers, lower difficulty settings can be offered. The freedom of interaction can be restricted to only those possibilities that yield pleasant satisfactory results. There can be a context sensitive component in the event generation just like a driving aid system that prevents some basic mistakes. Pitch values can automatically be aligned to the current diatonic scale in order to harmonize. A time delay can be used to fit each event perfectly
70
to the underlying meter and rhythmic structure. Advanced difficulty settings can be like driving without such safety systems. It is most interesting for trained players who want to experiment with a bigger range of possibilities. Interaction with high-level structures is less direct. The characteristic feature of this approach is the autonomy of the music. It plays back by itself and reacts to user behaviour. While the previously described musical instruments are rather perceived as a tool-like object, in this approach the impression of a musical diegesis, a virtual world filled with entities that dwell there and react and interact with the player, is much stronger. User interaction affects the arrangement of the musical material or the design principles which define the way the material is generated. In Amplitude (in standard gameplay mode) it is the arrangement. The songs are divided into multiple parallel tracks. A track represents a conceptual aspect of the song like bass, vocals, synth, or percussion and each track can be activated for a certain period by passing a skill test. Even this test derives from melodic and rhythmic properties of the material to be activated. The goal is to activate them all. The music in Amplitude is precomposed and, thus, relatively invariant. Each run leads ultimately to the same destination music. Other approaches generate the musical material just in time while it is performed. User interaction affects the parameterization of the generation process which results in different output. For this constellation of autonomous generation and interaction Chapel (2003) coined the term Active Musical Instrument, an instrument for real-time performance and composition that actively interacts with the user: “The system actively proposes musical material in realtime, while the user’s actions [.. .] influence this ongoing musical output rather than have the task to initiate each sound” (p.50). Chapel states that an Active Instrument can be constructed around any generative algorithm. The first such instrument was developed by Chadabe (1985). While music is created autono-
Diegetic Music
mously, the user controls expressive parameters like accentuation, tempo, and timbre. In Chapel’s case the music generation is based on fractal functions which can be edited by the user to create ever new melodic and polyphonic structures. Eldridge (2002) applies self-regulating homeostatic networks. Perturbation of the network causes musical activity—a possible way to interact with the system. The musical toy Electroplankton for Nintendo DS offers several game modes (called plankton types) that build up a musical domain with complex structures, for example, a melodic progression graph (plankton type Luminaria) and a melodic interpreter of graphical curves (plankton type Tracy). These can be freely created and modified by the user. A highly interactive approach that incorporates precomposed material is the Morph Table presented by Brown, Wooller, & Kate (2007). Music consists of several tracks. Each track is represented by a physical cube that can be placed on the tabletop: this activates its playback. For each track, there are two different prototype riffs represented by the horizontal extremes of the tabletop (left and right border). Depending on the relative position of the cube in-between, the two riffs are recombined by the music morphing techniques which Wooller & Brown (2005) developed. The vertical positioning of the cube controls other effects. The tabletop interface further allows collaborative interaction with multiple users. This anticipates a promising future perspective for music video games. Music making has always been a collaborative activity that incorporates a social component, encourages community awareness, interaction between musicians, and mutual inspiration. What shall be the role of music games in this context? Do they set the stage for the performers or function as performers themselves? In contrast to conventional media players, which are only capable of playing back prefabricated pieces, music video games will offer a lot more. They will be a platform for the user to experiment with and on which to realize his ideas. And they
will be—they already are—an easy introduction to music for everyone, even non-musicians, who playfully learn musical principles to good and lasting effect.
INtErActING WItH MUsIc: A cONcLUsION Music as a diegetic occurrence in interactive media cannot be considered apart from interactivity. But music being the object of interaction is a challenging idea. It is worth taking up this challenge. The growing popularity of music video games over the last few years encourages further exploration of the boundaries of interactivity and to surmount them. Music does not have to be static. It can vary in its expressivity regarding the way it is performed. Users can interact with virtual performers. These do not have to play fixed compositions. Let them ornament their melodies, vary or even improvise on them. Why not just generate new music in realtime while the game is played? Let the players exert an influence on this. Or enable them to playfully arrange or create their own music. Few of these possibilities are applied in practice up to now. Music is a living art that should be more than simply reproduced, it should be experienced anew each time. It is a temporal art and its transience is an inherent component. This chapter has shown how to raise music in interactive media above the status of its mere reproduction. As a domain of interactivity, it invites the users to explore, create, and to have new musical experiences.
rEFErENcEs AM3D (2009). AM3D [Computer software]. AM3D A/S (Developer). Aalborg, Denmark.
71
Diegetic Music
Aav, S. (2005). Adaptive music system for DirectSound. Unpublished master’s thesis. University of Linköping, Sweden. Adler, S. (2002). The study of orchestration (3rd ed.). New York: Norton & Company. Adorno, T. W., & Eisler, H. (1947). Composing for the films. New York: Oxford University Press. Arons, B. (1992, July). A review of the cocktail party effect. Journal of the American Voice I/O Society, 12, 35-50. Berndt, A. (2008). Liturgie für Bläser (2nd ed.). Halberstadt, Germany: Musikverlag Bruno Uetz. Berndt, A. (2009). Musical nonlinearity in interactive narrative environments. In G. Scavone, V. Verfaille & A. da Silva (Eds.), Proceedings of the Int. Computer Music Conf. (ICMC) (pp. 355-358). Montreal, Canada: International Computer Music Association, McGill University. Berndt, A., & Hähnel, T. (2009). Expressive musical timing. In Proceedings of Audio Mostly 2009: 4th Conference on Interaction with Sound (pp. 9-16). Glasgow, Scotland: Glasgow Caledonian University, Interactive Institute/Sonic Studio Piteå. Berndt, A., Hartmann, K., Röber, N., & Masuch, M. (2006). Composition and arrangement techniques for music in interactive immersive environments. In Proceedings of Audio Mostly 2006: A Conference on Sound in Games (pp. 53-59). Piteå, Sweden: Interactive Institute/Sonic Studio Piteå. Berndt, A., & Theisel, H. (2008). Adaptive musical expression from automatic real-time orchestration and performance. In Spierling, U., & Szilas, N. (Eds.), Interactive Digital Storytelling (ICIDS) 2008 (pp. 132–143). Erfurt, Germany: Springer. doi:10.1007/978-3-540-89454-4_20
72
Brown, A. R., Wooller, R. W., & Kate, T. (2007,). The morphing table: A collaborative interface for musical interaction. In A. Riddel & A. Thorogood (Eds.), Proceedings of the Australasian Computer Music Conference (pp. 34-39). Canberra, Australia. Chadabe, J. (1985). Interactive music composition and performance system. U.S. Patent No. 4,526,078. Washington, DC: U.S. Patent and Trademark Office. Chapel, R. H. (2003). Real-time algorithmic music systems from fractals and chaotic functions: Towards an active musical instrument. Unpublished doctoral dissertation. University Pompeu Fabra, Barcelona, Spain. Collins, K., Tessler, H., Harrigan, K., Dixon, M. J., & Fugelsang, J. (2011). Sound in electronic gambling machines: A review of the literature and its relevance to game audio. In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Ebcioglu, K. (1992). An expert system for harmonizing chorales in the style of J. S. Bach. In Balaban, M., Ebcioglu, K., & Laske, O. (Eds.), Understanding music with AI: Perspectives on music cognition (pp. 294–334). Cambridge, MA: MIT Press. Ekman, I. (2009). Modelling the emotional listener: Making psychological processes audible. In Proceedings of Audio Mostly 2009: 4th Conference on Interaction with Sound (pp. 33-40). Glasgow, Scotland: Glasgow Caledonian University, Interactive Institute/Sonic Studio Piteå. Eldridge, A. C. (2002). Adaptive systems music: Musical structures from algorithmic process. In C. Soddu (Ed.), Proceedings of the 6th Generative Art Conference Milan, Italy: Politecnico di Milano University.
Diegetic Music
Fastl, H., & Zwicker, E. (2007). Psychoacoustics: Facts and models (3rd ed., Vol. 22). Berlin, Heidelberg: Springer.
Hiller, L. A., & Isaacsons, L. M. (1959). Experimental music: Composing with an electronic computer. New York: McGraw Hill.
Firelight (2009). FMOD Ex. v4.28 [Computer software]. Victoria, Australia: Firelight Technologies.
Hitchcock, A. (1956). The Man Who Knew Too Much [Motion picture]. Hollywood, CA: Paramount.
Fitterer, D. (2008). Audiosurf: Ride Your Music [Computer game]. Washington, DC: Valve. Flossmann, S., Grachten, M., & Widmer, G. (2009). Expressive performance rendering: introducing performance context. In Proceedings of the 6th Sound and Music Computing Conference (SMC). Porto, Portugal: Universidade do Porto. FreeStyleGames (2009). DJ Hero [Computer game]. FreeStyleGames (Developer), Activision. Friberg, A., Bresin, R., & Sundberg, J. (2006). Overview of the KTH Rule System for musical performance. Advances in Cognitive Psychology. Special Issue on Music Performance, 2(2/3), 145–161. Galloway, A. R. (2006). Gaming: Essays on algorithmic culture. Electronic Mediations (Vol. 18). Minneapolis: University of Minnesota Press. Hansen, S. H., & Jensenius, A. R. (2006). The Drum Pants. In Proceedings of Audio Mostly 2006: A Conference on Sound in Games (pp. 60-63). Piteå, Sweden: Interactive Institute/Sonic Studio. Harmonix (2003). Amplitude [Computer game]. Harmonix (Developer), Sony. Harmonix (2006-2009). Guitar Hero series [Computer games]. Harmonix, Neversoft, Vicarious Visions, Budcat Creations, RedOctane (Developers), Activision. Herber, N. (2006). The Composition-Instrument: Musical emergence and interaction. In Proceedings of Audio Mostly 2006: A Conference on Sound in Games (pp. 53-59). Piteå, Sweden: Interactive Institute/Sonic Studio Piteå.
Hörnel, D. (2000). Lernen musikalischer Strukturen und Stile mit neuronalen Netzen. Karlsruhe, Germany: Shaker. Hörnel, D., & Menzel, W. (1999). Learning musical structure and style with neural networks. Computer Music Journal, 22(4), 44–62. doi:10.2307/3680893 Iwai, T. (2005). Electroplankton [Computer game]. Indies Zero (Developer), Nintendo. Jørgensen, K. (2011). Time for new terminology? Diegetic and non-diegetic sounds in computer games revisited. In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Kirnberger, J. P. (1767). Der allezeit fertige Polonaisen und Menuetten Komponist. Berlin, Germany: G.L. Winter. Klinger, R., & Rudolph, G. (2006). Evolutionary composition of music with learned melody evaluation. In N. Mastorakis & A. Cecchi (Eds.), Proceedings of the 5th WSEAS International Conference on Computational Intelligence, ManMachine Systems and Cybernetics (pp. 234-239). Venice, Italy: World Scientific and Engeneering Academy and Society. Konami (1998). Dance Dance Revolution. Konami, Disney, Keen, Nintendo. Kungel, R. (2004). Filmmusik für Filmemacher— Die richtige Musik zum besseren Film. Reil, Germany: Mediabook-Verlag. Lissa, Z. (1965). Ästhetik der Filmmusik. Leipzig, Germany: Henschel.
73
Diegetic Music
Livingstone, S. R. (2008). Changing musical emotion through score and performance with a compositional rule system. Unpublished doctoral dissertation. The University of Queensland, Brisbane, Australia. Loki & Creative. (2009). (1.1). Loki Software. Open, AL: Creative Technology. Löthe, M. (2003). Ein wissensbasiertes Verfahren zur Komposition von frühklassischen Menuetten. Unpublished doctoral dissertation. University of Stuttgart, Germany. LucasArts. (1997). Monkey Island 3: The Curse of Monkey Island. LucasArts. Manz, J., & Winter, J. (Eds.). (1976). Baukastensätze zu Weisen des Evangelischen Kirchengesangbuches. Berlin: Evangelische Verlagsanstalt. Mazzola, G., Göller, S., & Müller, S. (2002). The topos of music: Geometric logic of concepts, theory, and performance. Zurich: Birkhäuser Verlag. Meyer, J. (2009). Acoustics and the performance of music: Manual for acousticians, audio engineers, musicians, architects and musical instrument makers (5th ed.). New York: Springer. Microsoft. (2009). [Computer software] [. Microsoft Corporation.]. Direct, X, 11. Miranda, E. R., & Biles, J. A. (Eds.). (2007). Evolutionary computer music (1st ed.). USA: Springer. doi:10.1007/978-1-84628-600-1 Mozart, W. A. (1787). Musikalisches Würfelspiel: Anleitung so viel Walzer oder Schleifer mit zwei Würfeln zu componieren ohne musikalisch zu seyn noch von der Composition etwas zu verstehen. Köchel Catalog of Mozart’s Work KV1 Appendix 294d or KV6 516f. Namco (2003). Donkey Konga [Computer game]. Namco (Developer), Nintendo. NanaOn-Sha (1996). PaRappa the Rapper [Computer game]. NanaOn-Sha (Developer), Sony.
74
NanaOn-Sha (1999). Vib-Ribbon [Computer game]. NanaOn-Sha (Developer), Sony. Pachet, F., & Roy, P. (2001). Musical harmonization with constraints: A survey. Constraints Journal. Papadopoulos, G., & Wiggins, G. (1999). AI methods for algorithmic composition: A survey, a critical view and future prospects. In AISB Symposium on Musical Creativity. Edinburgh, Scotland. Pozzati, G. (2009). Infinite suite: Computers and musical form. In G. Scavone, V. Verfaille & A. da Silva (Eds.), Proceedings of the International Computer Music Conference (ICMC) (pp. 319322). Montreal, Canada: International Computer Music Association, McGill University. Roads, C. (1996). The computer music tutorial. Cambridge, MA: MIT Press. Röber, N. (2008). Interacting with sound: Explorations beyond the frontiers of 3D virtual auditory environments. Munich, Germany: Dr. Hut. Röber, N., Kaminski, U., & Masuch, M. (2007). Ray acoustics using computer graphics technology. In Proceedings of the 10th International Conference on Digital Audio Effects (DAFx-07) (pp. 117-124). Bordeaux, France: LaBRI University Bordeaux. Schottstaedt, W. (1989). Automatic counterpoint. In Mathews, M., & Pierce, J. (Eds.), Current directions in computer music research. Cambridge, MA: MIT Press. Sega (2001). Rez [Computer game]. Sega. Sevsay, E. (2005). Handbuch der Instrumentationspraxis (1st ed.). Kassel, Germany: Bärenreiter. Shultz, P. (2008). Music theory in music games. In Collins, K. (Ed.), From Pac-Man to pop music: Interactive audio in games and new media (pp. 177–188). Hampshire, UK: Ashgate.
Diegetic Music
Sierra (1993). Gabriel Knight: Sins of the Fathers [Computer game]. Sierra Entertainment. Stenzel, M. (2005). Automatische Arrangiertechniken für affektive Sound-Engines von Computerspielen. Unpublished diploma thesis. Otto-von-Guericke University, Department of Simulation and Graphics, Magdeburg, Germany. Stockmann, L. (2007). Designing an audio API for mobile platforms. Internship report. Magdeburg, Germany: Otto-von-Guericke University. Stockmann, L., Berndt, A., & Röber, N. (2008). A musical instrument based on interactive sonification techniques. In Proceedings of Audio Mostly 2008: 3rd Conference on Interaction with Sound (pp. 72-79). Piteå, Sweden: Interactive Institute/ Sonic Studio Piteå. Taube, H. K. (2004). Notes from the metalevel: Introduction to algorithmic music composition. London, UK: Taylor & Francis. Theremin, L. S. (1924). Method of and apparatus for the generation of sounds. U.S. Patent No. 73,529. Washington, DC: U.S. Patent and Trademark Office. Tobler, H. (2004). CRML—Implementierung eines adaptiven Audiosystems. Unpublished master’s thesis. Fachhochschule Hagenberg, Hagenberg, Austria. Verbiest, N., Cornelis, C., & Saeys, Y. (2009). Valued constraint satisfaction problems applied to functional harmony. In Proceedings of IFSA World Congress EUSFLAT Conference (pp. 925-930). Lisbon, Portugal: International Fuzzy Systems Association, European Society for Fuzzy Logic and Technology. Williams, L. (2006). Music videogames: The inception, progression and future of the music videogame. In Proceedings of Audio Mostly 2006: A Conference on Sound in Games (pp. 5-8). Piteå, Sweden: Interactive Institute, Sonic Studio Piteå.
Wingstedt, J. (2008). Making music mean: On functions of, and knowledge about, narrative music in multimedia. Unpublished doctoral dissertation. Luleå University of Technology, Sweden. Wooller, R. W., & Brown, A. R. (2005). Investigating morphing algorithms for generative music. In Proceedings of Third Iteration: Third International Conference on Generative Systems in the Electronic Arts. Melbourne, Australia.
KEY tErMs AND DEFINItIONs Diegesis: Traditionally it is a fictional story world. In computer games, or more generally in interactive media, it is the domain the user ultimately interacts with. Diegetic Music: Music that is performed within the diegesis. Extra-Diegetic: The terms extra-diegetic and non-diegetic refer to elements outside of the diegesis. Extra-diegetic is commonly used for elements of the next upper layer, the narrator’s world or the game engine, for instance. Nondiegetic, by contrast, refers to all upper layers up to the real world. Music Video Games: Computer games with a strong focus on music-related interaction metaphors. For playability, musical aspects are often, if not usually, transformed into visual representatives. Musical Diegesis: In music video games, the user interacts with musical data. These constitute the domain of musical possibilities, the musical diegesis. Nonlinear Music: The musical progress incorporates interactive and/or non-deterministic influences.
75
Diegetic Music
ENDNOtEs 1
76
Although this book prefers the generic term computer games, here, I use the term music video game both to emphasize the musical interaction and because it is the more commonly used term for this genre.
2
3
Building block music: translated from the German term “Baukastenmusik” (Manz & Winter, 1976). Likewise, non-diegetic film music does and cannot mediate the complete visual diegesis.
Section 2
Frameworks & Models
78
Chapter 5
Time for New Terminology? Diegetic and Non-Diegetic Sounds in Computer Games Revisited Kristine Jørgensen University of Bergen, Norway
AbstrAct This chapter is a critical discussion of the use of the concepts diegetic and non-diegetic in connection with computer game sound. These terms are problematic because they do not take into account the functional aspects of sound and indicate how gameworlds differ from traditional fictional worlds. The aims of the chapter are to re-evaluate earlier attempts at adapting this terminology to games and to present an alternative model of conceptualizing the spatial properties of game sound with respect to the gameworld.
INtrODUctION Two concepts from narrative theory that often appears in discussions about game sound are diegetic and non-diegetic (Collins, 2007, 2008; Ekman 2005; Grimshaw 2008; Grimshaw & Schott 2007; Jørgensen 2007b, 2008; Stockburger, 2003; Whalen, 2004). The terms are used in film theory to separate elements that can be said to be part of the depicted fictional world from elements that the fictional characters cannot see or hear and which should be considered non-existent in the fictional world (Bordwell, 1986; Bordwell & Thompson, 1997). According to this approach, dialogue beDOI: 10.4018/978-1-61692-828-5.ch005
tween two characters is seen as diegetic, while background score music is seen as non-diegetic. In connection with game sound, a likely adaptation of these concepts would describe the response “More work?” from an orc peon unit in the realtime strategy game Warcraft 3 (Blizzard, 2002) as an example of a diegetic sound since it is spoken by a character within the gameworld. Music that signals approaching enemies in the role-playing game Dragon Age: Origins (Bioware, 2009) would according to this view be an example of non-diegetic sound since the music is not being played from a source within the game universe. However, when analyzing the examples more closely, we see that using these terms in computer games is confusing and at best inaccurate. As a
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Time for New Terminology?
response to a player command, the “More work?” question has an ambiguous status in relation to the gameworld: If we ask ourselves who the peon is talking to, it appears to address the player, who is not represented as a character in the gameworld, but manages the troops and base from the outside of the gameworld. The warning music heard in the role-playing game is also ambiguous. Although there is nothing to suggest that the music is being played by an orchestra in the wilderness, there is no doubt that the music influences the players’ tactical decisions and therefore has direct consequence for the player-characters’ actions and the progression of the game. The confusion comes into being because game sound has a double status in which it provides usability information to the player at the same time as it has been stylized to fit the depicted fictional world. It works as support for gameplay, while also providing a sense of presence in the gameworld (Jørgensen, 2007a, 2009; Nacke & Grimshaw, 2011). From this point of view, diegetic and non-diegetic sounds tend to blend systematically in games, thereby creating additional levels of communication compared to the traditional diegetic versus non-diegetic divide. Although sound may be categorized and discussed in several ways, the diegetic versus non-diegetic divide may be especially attractive for describing modern computer games since they are set in universes separate from ours and that on the surface remind one of the fictional universes of film and literature. This makes the terminology seem like an illustrative approach for describing auditory properties with respect to the represented universe in games. The concepts enable us to separate what is perceived as internal to that universe from what is perceived as external to it. However, as this chapter will argue, the concepts of diegetic and non-diegetic are developed with traditional media in mind, and are therefore confusing and misleading when attempts are made to uncritically transfer them to computer games. First, the participatory role of the player is not accounted for in this theory, which
means that the functional aspects of game sound therefore disappears when applying diegetic and non-diegetic to game sound. Also, gameworlds cannot be appropriately described by these terms since they are designed for different purposes than traditional fictional worlds. Since gameworlds invite users to enter their domains as players, they are qualitatively different from other fictional worlds, and this makes the traditional diegetic versus non-diegetic divide problematic when applied to computer games. While the aim of the chapter is to evaluate the use of the two concepts in relation to game sound, the chapter will also be a revision of my earlier theory on transdiegetic sounds (Jørgensen, 2008b). I will discuss my own and other attempts at adapting the concepts to game sound, based on the original meaning and uses of diegesis, and present an alternative way of conceptualizing the phenomena in relation to game sound. The main argument of this chapter rests on two principles. One is that the participatory nature of games allows the players a dual position where they are located on the outside of the gameworld but with power to reach into it. The other is that gameworlds differ from traditional fictional worlds in fundamental ways as they are worlds intended for play. This difference requires game sound to be evaluated on terms other than those used for analyzing film sound. A short reader guide is appropriate. The chapter is organized according to principles of clarity where an overview of earlier theory creates the basis of the argument and, in order to get the most out of the chapter, it should be read from beginning to end rather than being dipped into. I will introduce the chapter with a discussion of the origin and application of diegetic and non-diegetic in traditional media before going on to present other attempts at categorizing game sound (Collins, 2007, 2008; Huiberts & van Tol, 2008; Stockburger, 2003; Whalen, 2004). Next, the chapter will review different attempts to adapt diegetic terminology to games (Galloway, 2006) and game sound (Ekman, 2005; Grimshaw,
79
Time for New Terminology?
2008; Jørgensen, 2007b). I will then discuss how gameworlds separate themselves from traditional fictional worlds and that this has consequences for the way we interact with them (Aarseth, 2005, 2008; Klevjer, 2007), and consequently for the application of diegetic and non-diegetic. The last section of the chapter will present an alternative model for analyzing game sound in terms of spatial integration. Throughout the chapter, I will also use data from research interviews with empirical players where this is appropriate. The data concerns player interpretations of so-called transdiegetic features in computer games, and support the idea that gameworlds work on premises other than traditional fictional worlds. Although this chapter focuses on the auditory aspect of games in particular, it should be noticed that the discussion about the relevance of diegetic and non-diegetic features does not concern auditory features alone. However, sound is particularly interesting for several reasons. Since sound is neither tangible nor visible, and has a temporal quality, it has the ability to remain non-intrusive even when it breaks the borders of the gameworld. The ability to seamlessly integrate with the gameworld gives it the opportunity to challenge the relationship between diegetic and non-diegetic in a way that visual information cannot.
bAcKGrOUND Diegetic vs. Non-Diegetic sound The term diegetic originally stems from The Republic, where Plato separates between two narrative modes that he calls diegesis and mimesis. Diegesis, or pure narrative, is when the poet “himself is a speaker and does not even attempt to suggest to us that anyone but himself is speaking”; while mimesis, or imitation, is when the poet “delivers a speech as if he were someone else” (Plato in Genette, 1983, p. 162). According to film scholar David Bordwell (1986), the term
80
diegesis was revived in the 1950s to describe the “recounted story” of a film, and today it has become the accepted terminology for “the fictional world of the story” (p. 16). According to this terminology, diegetic sound is represented as “sound which has a source in the story world”, while non-diegetic sound is “represented as coming from a source outside the story world” (Bordwell & Thompson, 1997, p. 330). Game scholars who use diegetic and non-diegetic when describing game sound, tend to take their point of departure from this newer, film theory understanding of diegesis, and extend the meaning of the “fictional world of the story”, to the universe of the game. As mentioned, this is confusing since it implies that the gameworld is a storyworld, and is misleading because game sound works for different purposes compared to film sound. These points will be in focus in the following discussion that critically evaluates the use of diegetic and non-diegetic in relation to computer game sound. Of course, the debate about the relationship between diegetic and non-diegetic features is not unique to game studies. Also, film theory sees the limited ability of this theory to precisely describe sound. While David Bordwell and Kristin Thompson (1997) define non-diegetic sound as “represented as coming from a source outside the story world” (p. 330), Edward Branigan separates non-diegetic features into extra-fictional and non-diegetic. He argues that when a piece of background film music is accompanying the credits of a film, it should be interpreted as extra-fictional, but when it accompanies a series of shots from a nightclub, and is thus presented as typical of an evening at that location, it should be interpreted as non-diegetic (1992, p. 96). In this view, Branigan claims that non-diegetic sound is related to the diegesis, but does not correspond to the fictional characters’ experience of it (1992, p. 96), while extra-fictional sound exists outside the diegesis and is required to talk about the diegesis as fictional (1992, p. 88). Although not accounting for the participatory nature of games, Branigan’s view
Time for New Terminology?
of non-diegetic is more sympathetic towards how, for instance, score music works in games, since there is some kind of bond between the sound and what happens within the diegesis. When discussing film music, Michel Chion also points out that the non-diegetic category is complicated. A central reason, in his view, is that so-called diegetic music, like non-diegetic music, may have a commentary function meant to help the interpretation of what is going on in the film. Chion’s own example is Siodmak’s Abschied, in which the protagonist’s emotional states are being punctuated by the music of his pianist neighbor, thereby questioning the non-diegetic state of the music. Because of such ambiguous cases, Chion argues that the reference to diegetic and non-diegetic music is misleading, and uses pit music and screen music instead. While pit music “accompanies the image from a non-diegetic position, outside the time and space of action”, screen music refers to “music arising from a source located directly or indirectly in the space of time” (Chion, 1994, p. 80). From this approach, screen music could also be used to describe the computer game version of leitmotifs (Gorbman, 1997, pp. 3, 26-29), in which music with an apparent non-diegetic source warns the player about dangers. The relationship between diegetic and nondiegetic is not a simple one in literary theory either. One example of this is provided by Gerard Genette, who points out that the diegetic and nondiegetic levels often blend together in the act of narration. He uses the term metalepsis to describe any transition from one diegetic level to another. While the classics used the term to refer to “any intrusion by the non-diegetic narrator or narratee into the diegetic universe” (Genette, 1983, pp. 234-235), Genette extends the term and calls all kinds of narrative transitions of elements between distinct levels of the literary diegesis narrative metalepsis”. In literature, these transitions range from simple rhetorical figures, where the narrator addresses the reader, to extremes in which a man is killed by a character in the novel he is reading.
However, being closely connected to the act of narration—how a story is told—metalepsis only serves as a comparative illustration for the transboundary movement that happens in computer games. These methods of categorization show that the relationship between diegetic and non-diegetic sound is not without debate in film theory and literary theory but, while the concepts work as a point of departure and as a common ground for understanding the narrative levels of traditional fiction, they create confusion in connection with computer game sound because of the participatory nature of games and gameworlds (Collins, 2008, p. 180; Jørgensen, 2006, p. 48, 2007b, p. 106). In films and computer games equally, sound cues the media user’s understanding of the environment, direction, spatiality, temporality, objects and events. However, film sound is limited to informing the audience as to how to interpret what is going on in an inaccessible world while game sound provides information relevant for understanding how to interact with the game system and behave in the virtual environment that is the gameworld (Jørgensen, 2008). This means that game sound has a double status in which it provides usability information to the player at the same time as it has been stylized to fit the depicted universe. This may create confusion with respect to the role of the sound since it appears to have been placed in the game from the point of view of creating a sense of presence and physicality to the game universe while it actually works as a support for gameplay. A comparison serves as illustration. When the players of The Elder Scrolls III: Morrowind (Bethesda, 2002) hear the music change when navigating through a forest, they know that an enemy is approaching, and may act accordingly. However, since this music has no source in the gameworld, the player character should not be able to hear it, but since the player does hear it and may act upon it, the character also seems to act as if it knows enemies are approaching even though it does not yet see them coming. In this sense, sound
81
Time for New Terminology?
that appears to be non-diegetic affects diegetic events, thereby disrupting the traditional meaning of diegetic and non-diegetic sound (Jørgensen, 2007b). In Pulp Fiction (Tarantino, 1994), on the other hand, one of the characters is sitting in his car accompanied by what at first appears to be non-diegetic music. Suddenly he starts whistling along with the music. In this case, the audience is not led to believe that the character hears music that is not present; instead, they re-interpret the music not as non-diegetic, but as diegetic music played on the car radio. On the surface, the situations from the game and the film may appear similar, but in terms of how it affects its context, there is a huge difference between the film music and the game music: In the case of the film music, we revise our interpretation when we realize that the fictional character actually can hear it (Branigan, 1992, p. 88). There is therefore never any ambiguity connected to the origin of the music, and we are never led to believe that the character hears music that is not present in his world. The game music, on the other hand, has a functional value related to the game system: it provides a warning to the players about a change in game state: namely that an enemy is aware of their presence and about to attack. In this sense, the role of game music is to enable the player to use its informative value to make progress in the game. In this respect, film music and game music have fundamental different roles. While film music provides clues about moods, upcoming events, and how to interpret specific scenes, game music works as a user interface that provides usability information that helps players progress in the game. Also, while non-diegetic film music never allows the audience to change the protagonists’ behavior or to save them from certain death, game music can enable the player to guide their avatar away from danger or to make them draw their sword even before the enemy has appeared. This is, of course, a direct result of the difference between players and audiences and it puts emphasis on the fact that the concepts of diegetic and non-diegetic
82
have not been designed to take this difference into account, and is therefore not sufficient for analyzing sound in computer games.
categorization of Game sound There have been different attempts to categorize game sound and, in this section, I will present some of the most fruitful endeavors. Although only a few scholars base their descriptions on whether or not sounds are diegetic and non-diegetic, many refer to the concepts and may in some cases use them as unambiguous ways to look at sound. This section will provide a short overview of such scholarly attempts before the next section goes on to discuss specific attempts to adapt diegetic and non-diegetic concepts to game sound. Alex Stockburger (2003) was perhaps the first academic that came up with a method of categorization for game sound. He defines a number of “sound objects” according to their use in the game environment, and separates between score sound objects, zone sound objects, interface sound objects, speech sound objects, and a range of different effect sound objects connected variously to the avatar, to objects usable by the avatar, to other game characters, to other entities, and to events. Although Stockburger emphasizes the importance of understanding the functional role of sound, his categories do not cover this. Instead, his model describes sound according to what kind of object it is connected to in the game engine. He also uses diegetic and non-diegetic as matter-of-fact and straightforward concepts and does not discuss how they should be interpreted in terms of game sound. One who does argue that diegetic concepts can be usefully applied to game sound is Zack Whalen. He states that non-diegetic game music has two functions; to “expand the concept of a game’s fictional world or to draw the player forward through the sequence of gameplay” (2004). In other words, it can either support the sense of spatiality and presence in the game environment, or support the player’s progression through the
Time for New Terminology?
game. His approach is interesting as it takes into account the fact that game music provides information relevant for gameplay, but by being tied to the traditional meaning of non-diegetic it is equally misleading as other adaptations of the concepts. A scholar who does see the diegetic/nondiegetic division as complicated is Karen Collins (2007, 2008). She points out that the division between diegetic and non-diegetic sound is problematic since the player is engaging in the on-screen sound playback process directly (2008, p. 125). Her separation between interactive and adaptive sound is based on functionality. Whereas interactive sound refers to sound events occurring in response to player action, adaptive sound reacts to events in the environment (2007, 2008, p. 4). In this respect, sound is understood as a dynamic feature closely related to events, at the same time as it takes into account the agency of the player. Huiberts & Van Tol (2008) also point out that using diegetic and non-diegetic is complicated in connection with game sound, since interactivity allows non-diegetic sounds to affect diegetic events. They still decide to use the terms because they see them as established within game studies. By putting diegetic and non-diegetic in context with setting and activity, their IEZA framework takes into account the interactive aspects of game sound, but does not take into consideration that gameworlds are designed for different purposes compared to diegeses, and that they therefore influence sound in a different way. There are also other models for describing sound in this anthology. Wilhelmsson & Wallén’s (2011) general framework for sound design and analysis combines theories of listening with both the IEZA framework and Murch’s description of five layers between “encoded” and “embodied” sound in film ranging from speech to music via effect sounds: However, like many others, they take the fruitfulness of diegetic and non-diegetic for granted. In his discussion of diegetic music, Berndt (2011) claims that what he calls visualized music must be considered diegetic. This is the
visualization of structural features of a musical composition, exemplified by the stylized visualization of patterns found in the user interface of music games such as Rock Band (Harmonix, 2007) and Electroplankton (Indies Zero, 2006). From the point of departure of this chapter, this view of diegetic is problematic, since it distances itself from the original use of diegesis and thereby creates confusion. Milena Droumeva, on the other hand, outlines a framework of game sound according to “realism” in terms of fidelity and verisimilitude, and connects these to acoustic ecology and Barry Truax’ idea of an acoustic community that includes physical world sounds that have an impact upon gameplay. Examples of this are the acoustic soundscape of group play, and online conferencing (“live chat”) (Droumeva, 2011). From this perspective, she argues that the use of diegetic and non-diegetic terminology is limited because it fails to acknowledge the importance of these kinds of sounds. Although a valid point when discussing the general soundscape of the gaming activity, this point has only limited value to the argument of this chapter, since it is restricted to how game internal sound works with respect to the gameworld, and only briefly mentions externally produced sounds.
Diegetic theories of Game sound Some of the more critical attempts at adapting diegetic and non-diegetic to games have resulted in analyses that show that game sound has more significant layers of meaning than can be explained by using the terminology above. In this section, I will evaluate the most comprehensive of these adaptations and discuss their strengths and weaknesses. However, even though the following accounts are attentive to how the concepts of diegetic and non-diegetic when used for describing games differ from how they are used for films, emphasizing this difference may lead to a situation in which one keeps leaning too heavily on a terminology that is meant to describe film sound, without be-
83
Time for New Terminology?
ing able to free oneself to establish a new model designed to take the particular characteristics of game sound into account. A game scholar that partly succeeds in using diegetic and non-diegetic in his description of games is Alexander Galloway (2006). Focusing on games as activities, he couples the terminology with his own terminology of whether it is the player (operator) or game system (machine) that performs the act. His model describes all actions as executed either inside the “world of gameplay” or outside of it and whether it is the player or the game system that takes a specific action. In this way, he describes all actions from the player firing a gun to configuring the options menu, from the movements of non-playing characters to the spawning of power-ups. While the categories themselves are not crucial to this chapter, Galloway’s perspective is important. He emphasizes the fact that games are activities and that they must be described as such. He also states that when diegetic and non-diegetic are used in connection with games the meaning of the terms changes (Galloway, 2006). However, even though he points this important fact out, Galloway’s use of these terms is somewhat confusing since he, like I do with the term transdiegetic, tries to change the concepts from describing the relative positioning of features in space to describing actions. The model is worth mentioning, however, since the action-oriented perspective supports sound by focusing on temporality: that is, like sound, action is time-based. Galloway’s approach to diegesis as a “world of gameplay” is also closely related to Mark Grimshaw’s radical modification of what should count as diegetic sound in computer games. He extends the idea of diegetic sound compared to film theory, and states that in computer games, diegetic sound is “defined as the sound that emanates from the gameplay environment, objects and characters and that is defined by that environment, those objects and characters”, and that it must “derive from some entity of the game during play” (Grimshaw,
84
2008, p. 224). In this respect, sounds do not have to be placed within the game environment in a way that we recognize from the physical world. In other words, as long as the referent is diegetic, the signal does not need to be. There is no need to have a character in the gameworld that produces the sound for it to count as diegetic. For Grimshaw, sounds are diegetic as long as they relate to actions and events in the gameworld. He exemplifies by pointing out that sounds signaling the entrance or exit of players in a multiplayer game should be considered diegetic since they concern entities in the game environment and affect their behavior. Based on this understanding, Grimshaw elaborates that diegetic game sounds are not limited to sounds that exist in the gameworld but that we also need to take into account all sounds that provide information relevant for understanding the gameworld. In effect, this would also include the traditional background music that signals an enemy about to attack in The Elder Scrolls III: Morrowind, and disembodied voiceovers in Warcraft 3. By introducing additional new concepts that specify whether a sound is heard by a specific player (ideodiegetic sounds), and whether such a sound results from the player’s haptic input or not (kinediegetic versus exodiegetic sounds) (Grimshaw & Schott, 2007; Grimshaw, 2008), Grimshaw creates a game-specific terminology that recognizes its theoretical relationship to the diegetic or non-diegetic divide. A concept that is particularly interesting is what he calls telediegetic sounds. Connected to multiplayer situations, these are sounds produced by one player and of consequence for a second player who does not hear that sound. While it may be seen as a paradox to call this information auditory when it is in fact the action of the first player that affects the second player, the concept has interesting implications. If we detach the concept from the idea that it must be heard by a first player, it may be extended to all situations in which players appear to react to a sound that they do not hear, such as is the case when players apparently react to the traditionally
Time for New Terminology?
speaking non-diegetic music of approaching enemies. However, even though Grimshaw’s theory emphasizes all sounds that have relevance for player actions in the gameworld, it is confusing that he still insists on using the concept diegetic also for sounds that appear to have no source in the game environment and that the avatar should not be able to hear. In any respect, Grimshaw’s extension of what counts as diegetic, and his focus on the player in relation to the concept, are strong arguments for exchanging the existing terminology with new. In my Ph.D. research (Jørgensen, 2007a, 2009), I developed a model of categorization that took into consideration functionality with respect to usability and type of information, location with respect to the gameworld, and referentiality with respect to the relationship between sound signal and the event it refers to (2007a, pp. 84-87). In Jørgensen (2008), the model was further developed to include what generates a specific sound. However, in describing the location of sound with respect to the gameworld, these models both included references to the diegetic/non-diegetic divide by the use of the neologism transdiegetic sounds (Jørgensen, 2007b). This approach described sound as transdiegetic by way of transcending the border between diegetic and non-diegetic: Diegetic sounds may address non-diegetic entities, while non-diegetic sounds may communicate to entities within the diegetic world. Such sounds have an important functional value in computer games by being an extension of the user interface and providing information such as feedback and warnings to the player. Utilizing the border between diegetic and non-diegetic, transdiegetic sounds merge game system information with the gameworld and create a frame of reference that has usability value at the same time as it upholds the sense of presence in the gameworld. Using this terminology, I argued that apparently non-diegetic music that provides information relevant for player action in the gameworld is external transdiegetic since the musical source is not found within the
gameworld but is external to it. The same goes for the disembodied warning “Our base is under attack!” in Warcraft 3. It is external transdiegetic because it provides information relevant to player action, but is not produced by anyone within the gameworld. When the avatar in Diablo 2 (Blizzard, 1998) claims “I’m overburdened”, however, I called the sound internal transdiegetic because the avatar as a character existing in the gameworld communicates to the player situated in an external position. The strengths of transdiegetic as concept are that it emphasizes the functional role of the sound in relation to player action in the gameworld, and it points out that the spatial origin of the sound is often relative. It is also able to describe all game sounds by using the same framework. However, it is confusing that it is based on the term diegesis, which creates connotations to the mechanisms of narratology and storytelling. Also, the internal and external variations are flawed as they appear to be two variations over the same theme, while in reality they are not. While internal transdiegetic sounds can easily be interpreted as abstractions of “diegetic” sounds since they are partly integrated into the game environment, external transdiegetic sounds are externally situated but with clear impact on the game environment. Inger Ekman’s approach to game sound (2005) is closely related to that of transdiegetic sounds. Common to Ekman’s and my account is the idea that the space of the gameworld is not absolute, and that information is carried across its boundaries. Another common ground is the idea that game sounds are used to integrate the game system into the environment in which it is set. From a semiotic perspective, she observes that game sounds that traditionally would be labeled diegetic, often have non-diegetic referents, and vice versa. In this respect, computer game sound is not limited to being diegetic or non-diegetic, but creates two additional layers that may be used to integrate non-diegetic elements connected to the game system into the diegetic world of the game. Masking sounds is her term for diegetic sound signals with non-diegetic
85
Time for New Terminology?
referents. Such sounds appear to be produced in the gameworld, while its referent is a mechanic of the game system. An example of a masking sound can be found in World of Warcraft when a monster attacks the avatar preemptively. In such cases, a sound specific for that monster will be heard that signals to the player that the avatar has entered the aggression zone of that monster. This sound is hard to interpret as natural to the world of the game since no animal would signal to its prey that it is about to attack. Being represented by a sound signal with a source in the gameworld, the sound has the ability to mask its origin as a system message by being integrated into the gameworld, and thus becomes situated on the border of what is traditionally seen as the diegesis. Ekman calls a sound symbolic, however, in cases where the signal is non-diegetic and the referent is diegetic. An example of this is adaptive game music that is not produced by a source in the gameworld, but refers to an event in the gameworld, such as is the case when the player suddenly hears the music change when an enemy is about to attack in Dragon Age: Origins. Although Ekman’s model is fruitful in explaining how game sound relates to the traditional film theory understanding of diegetic and non-diegetic sound, it also demonstrates the problematic aspects of applying these concepts to games because game sound in many cases is only partially diegetic. Also, there are many examples of sounds that cannot be fully explained by Ekman’s model. When a voice that apparently belongs to the avatar proclaims that “I’m overburdened” in Diablo II, it is not certain whether signal and referent are diegetic or not. While the signal gives the impression of being diegetic due to the use of the first person personal pronoun and the fact that it is produced by a voice that seems to belong to the avatar, it may also be interpreted as a non-diegetic system sound masked as diegetic since it is unclear who the avatar is talking to (itself or the player?) and since it provides information about the inventory, which is the game system feature that allows the
86
player to collect and store items in the game. This interpretation was suggested by two player respondents in my research on the topic of transdiegetic communication: […] Well, it is the character’s voice saying this. But still I don’t get the feeling that it is the character speaking. It’s like the game narrator’s voice provides the player with a hint that, okay, you should check your inventory. […] (John, (30). Individual interview, Dec 10, 2008.)1 It’s a like some sort of error, or a… if you want to see her as an individual person, it’s really an error. Because then the question is, who is she talking to? […] (Isabel (25). Individual interview, Dec 1, 2008.) While John sees the above sound signal as a system message masked as diegetic, Isabel thinks of it as an error since it is unclear who the avatar is talking to. In this case, the referent is also ambiguous in the same way as it is not clear whether the sound refers to the fact that the avatar is trying to pick up something in the gameworld but fails or to the fact that the inventory is overloaded. Warcraft 3 provides another example. When the player tries to place a new building on an illegal location, a disembodied voiceover says, “Can’t build there!” At first glance, the signal seems to be non-diegetic since there is no character in the gameworld that produces the sounds. However, this is challenged by the fact that the voice and the accent are very similar to the voices of the other units of that race. The referent is even more ambiguous: while the sound refers to an operation that is illegal according to the game system, it also refers to the fact that this specific location in the gameworld has diegetic properties such as trees or existing structures that makes it impossible to build here. As has been demonstrated in the above discussion, the attempts to adapt the concepts diegetic and non-diegetic to game sound point to interesting
Time for New Terminology?
aspects that recognize the specificities of game sound compared to sound in other media. At the same time, however, these attempts also demonstrate that the use of concepts designed to explain traditional media is problematic and confusing. There is a need to invent a terminological apparatus that fully grasps the uniqueness of game sound without trivializing it or confusing it with related, but different, features in other media. However, what the adaptations above have in common, is seeing game sound as qualitatively different from sound in other audio-visual contexts. Specifically, there is a tendency to pay attention to the interactive nature of game sound and to see it as a part of the user interface of the game in that it provides information to the player that helps feedback and control (Saunders & Novak, 2006). These adaptations also suggest that gameworlds operate in a different manner compared to storyworlds. This is particularly evident in Grimshaw’s extended understanding of diegetic sound as all sounds that derive from a gameplay event. In the following I will discuss how the understanding of game sound as interface, and the gameworld as a different construct to traditional diegeses, affects the idea of diegetic sound and I suggest alternative ways of discussing the relationship between the gameworld and game sound.
sOUND AND tHE GAMEWOrLD I have suggested above that diegetic and nondiegetic are problematic in connection with games and game sound because gameworlds are different constructs compared to traditional fictional worlds, or diegeses, and because of the way the players interact with them. In this section I will go into the characteristics of gameworlds, what makes them different from traditional fictional worlds, and what consequences this has for understanding their sound usage. Rune Klevjer rejects using the term diegesis to describe gameworlds due to its link to storytelling,
and argues that gameworlds are radically different from storyworlds because they are worlds designed for playing games. This means they are unified and self-contained wholes, structured as arenas for participation and contest, and are therefore subject to a coherent purpose (Klevjer, 2007, p. 58). Such worlds are created around a different logic than “fictional storyworlds” and, as long as all elements are explained as being parts of the game system, they do not need to be explained as a credible part of a hypothetical world. Espen Aarseth (2008) makes a clear distinction between gameworlds and fictional worlds by stating that the virtual world of World of Warcraft (Blizzard, 2004) is no fictional world but instead “a functional and playable gameworld, built for ease of navigation” (p.118). This is also emphasized in Aarseth (2005) in which he describes the environmental design of Half-Life 2 (Valve, 2004). It is a carefully designed environment with a specific layout that guides the players through specific areas, and limits the freedom of navigation in order to set up the challenges of the game, at the same time as it is given properties that remind one of the physical world in terms of world-representation. I want to follow up on Klevjer’s and Aarseth’s approaches and further point out that gameworlds are universes designed for the purpose of playing games. This means that they are fitted for very specific uses, and their layouts are decided in terms of functionality according to the game system. Environmental features and dungeon layouts are not created randomly but, because of careful design, they are oriented towards a specific gameplay experience. This view will be the starting point for the following discussion that will focus on the functional aspects of gameworlds and sounds connected to it. As we will see, this view of the gameworld is important for understanding how sound is used, and explains why players do not see what I earlier called transdiegetic sounds as interfering. As different constructs compared to traditional fictional worlds, gameworlds operate on other
87
Time for New Terminology?
premises. One characteristic of gameworlds is that they need to have a comprehensive system for player interaction. They need to be able to communicate necessary information about changes in game state and allow the player the necessary degree of control. Many of these interface features, including sounds, are often added to the game as abstractions of specific game mechanics partly integrated into the gameworld and, as that, it is problematic to see them as either diegetic or nondiegetic in traditional terms. Instead of looking at what would be a credible representation of a naturalistic world, we should look at how the gameworld and the game system work to support each other. If the game rules state that monsters growl when attacking, and that individuals respawn with their amour 10% damaged after being killed, this is the premise of the specific gameworld. This is a view that is a familiar one for empirical players. One of the player respondents in my empirical research states it thus: […] In this world, you can define whatever you would like there to be, it doesn’t seem that things are very credible in themselves. Q: So why do we accept it? Because it’s a game. And that is something completely different from a film. (Isabel (25). Individual interview, Dec 01, 2008) Here Isabel emphasizes the idea that gameworlds do not need to be a credible alternative to other fictional worlds, and that game designers can decide what they want to include as existent in their world: Because they are integrated with the game system, gameworlds are necessarily different from fictional worlds, such as films. This interpretation supports Grimshaw’s extended view of what counts as diegetic in computer games, but at the same time it amplifies the problematic aspects of using diegesis as explanatory terminology,
88
since gameworlds functionally are very different from literary or cinematic diegeses. Based on the above, the upholding of the game system by the gameworld also has consequences for the integration and design of sound in games. All game sounds have a function with respect to the gameworld, be it to provide information relevant for gameplay or to provide a specific atmosphere. Specific games and genres use sound in different ways and the degree to which it is incorporated into the gameworld plays an important role for reasons of clarity and consistency and in order to create an immediately understandable relationship between the sound and the gameworld. When designing user interfaces for games, a designer needs to decide how to present information to the player. Central to this is deciding which menus that should allow interaction or not, how and whether the user interface should be integrated into the gameworld, and how sounds and visual elements should work together. Game designers Kevin Saunders and Jeannie Novak (2006) describe two ways of relating the user interface to the gameworld and the gamespace. A dynamic interface supports the idea that all audio-visual aspects of a game should be seen as interface because they all provide the player with some kind of information, and dynamic interfaces are therefore completely incorporated into the gameworld. An example is the way an avatar’s amour and weapons provide information in a massively multiplayer online game (MMO)2 like World of Warcraft: By looking at what gear the opponent has, a player receives vital information about class, level and power of that avatar. A static interface, on the other hand, is an overlay interface that consists of external control elements such as health bar, map, pop-up menus, inventory, action bars and so on. Since user interface and gameworld often tend to merge, making the boundary between gameworld and interface relative (Jørgensen, 2007b, 2008, 2009), the static/dynamic divide should not be seen as absolute, but as a continuum where the interface may be more or less integrated
Time for New Terminology?
into the gameworld. Used as an interface, sound often takes on a relativistic position where it is integrated into the gameworld while remaining part of the game system. Using sound signals that are based on real world sounds, but which have been stylized, user interface designers add sounds that provide the necessary usability information at the same time as ensuring the sounds seem natural to the environment of the game. Ekman’s masking sounds are textbook examples of this. Another example is the response “More work?” by Warcraft 3’s orc peons. As a verbal statement produced by a character in the gameworld, it has a direct link to that gameworld, but at the same time it is an interface sound produced in response to player action. However, the sound is not an actual sound of an event in the gameworld, since it would make little sense if the peon actually were talking to the player.
Gameworld vs. Gamespace So far we have seen that game system information and game user interface features such as sound may be more or less integrated into the gameworld. However, they will also have a specific relationship to the gamespace of a specific game. Looking at this relationship may provide us with clear insights into how gameworlds work compared to diegeses. Gamespace should be understood as the conceptual space in which the game is played (Juul, 2005, p. 167), independent of any possible fictional universe used as a context for it. It is thus the arena on which gameplay takes place, and includes all elements relevant for playing the game. According to the magic circle theory (Huizinga, 1955, p. 10; Salen & Zimmermann, 2004, pp. 94-95) all games are seen as a subset of the real world, delimited by a conceptual boundary that defines what should be understood as part of the game and not. The magic circle is what separates the game from the rest of the world, and defines thus the gamespace (Juul, 2005, pp. 164-167). One may go as far as claim-
ing that all elements affecting gameplay should be counted within in the gamespace, regardless of whether these are part of the original system or design. From this point of view, gamespace seems to be equivalent to Grimshaw’s and Berndt’s understanding of diegesis, since it includes external system features relevant for gameplay, such as voiceovers announcing new players entering the game. Gamespace is therefore also what Droumeva (2011) seems to have in mind when focusing on the importance of live chat and talk that happens during group play. The gamespace is thus separated from the gameworld by including all features that have direct relevance to progress in the gameworld, be it score music signaling approaching enemies or add-on software in World of Warcraft, while the gameworld is the contained universe or environment designed for play in which actions and events take place. In this sense, a static overlay interface of a computer game is part of the gamespace, even though it may not be part of the gameworld, while a dynamic integrated interface would be part of the gameworld. For clarification, take the screenshot from Diablo II in Figure 1 as an illustration. The right half of the screen consisting of inventory, the bottom action bar including health and mana measurements, and the upper left icon of the avatar’s minion are all parts of the overlaid interface. These should not be interpreted as part of the gameworld, which is represented by the virtual environment on the left. The interface features are, however, directly relevant for player progress in the gameworld, and they are also attributes governed by the game system. They must therefore be seen as part of the gamespace; that is, the space of action relevant for the game progression included within the magic circle of the game. Now consider the left side of the screenshot, a screen segment of the gameworld. One interesting feature in this part of the image is the small illuminated icon above the avatar’s head which represents a boost to the avatar’s stamina. In terms of transdiegeticity, I would have explained this feature as internal transdiegetic because, in
89
Time for New Terminology?
Figure 1. Gamespace vs. gameworld. Diablo 2. ©2000 Blizzard Entertainment, Inc. All rights reserved
a traditional sense, it is a feature that seems alien to the diegesis while at the same time it provides information about the gameworld. However, viewing gameworlds as different constructs compared to traditional fictional worlds, the icon is clearly part of the gameworld, since it is not part of the overlay interface, but a feature picked up as the avatar visited a stamina well and which follows the avatar everywhere he walks. Since gameworlds works on other premises than traditional diegeses, players would have no problem accepting that this is part of the gameworld even though the avatar is not aware of it. There is an important direct link between the gamespace and the gameworld which is particularly accentuated by the use of sound. When the player decides to discard an item in the screenshot above, he will use his mouse to drag and drop the item from the inventory on the right to the virtual environment on the left or, in other words, he will move it from the gamespace to the gameworld. The moment he selects the item in the
90
inventory, there will be a short, nondescript click which does not seem to represent any actual sound in the gameworld. However, once he discards it in the gameworld, there is a responsive sound resembling that item being dropped to the ground. If it is a potion, there is a bubbling sound and, if it is a weapon, there is the sound of metal hitting the ground. By being adjusted to the atmosphere of the different spaces, the sound clearly emphasizes which frame it belongs to; there is no doubt, though, that it does move from one to the other. However, how this movement from frame to frame is achieved may vary between games and genres. A first-person shooter like Crysis (Crytek, 2007) that integrates the interface as a HUD3 that is part of the avatar’s suit situates the relationship between gameworld and gamespace somewhat differently from third person perspective avatar-based games. One of the empirical player respondents elaborates:
Time for New Terminology?
I’m absolutely positive to the idea [that the avatar sees the HUD]. It’s presented so that the suit he’s wearing […] in a way provides all the information that you need, through the perspective. And, well, it’s one solution, they probably try to make it an integrated part of this world. (Eric, (26). Individual interview, Nov 28, 2008) Here, even the HUD and overlaid features must be interpreted as part of the gameworld and thus the gameworld and the gamespace overlap each other more or less completely. The reason for this is that the game user interface designers have decided to make the interface part of the avatar’s advanced military suit so that all audio-visual information is provided to the avatar in the same manner as it is provided to the player. While all features are part of the gamespace as long as they are not connected to external menus in which one changes the game settings or starts a new game, they may or may not be connected to the gameworld as well4. If they are, they are typically positioned in the gameworld in the same way as what I earlier called internal transdiegetic features. While not appearing to be native to the gameworld, they are still positioned inside it graphically. They may be placed above the heads of non-playing characters in a way that allows the player to move around it: It will move with the environment, and not with the overlay interface that is tied to the edges of the screen. An example of a corresponding auditory feature, is the “Hi, you’re a tall one!” response from a nonplaying character (NPC) in World of Warcraft. Features I earlier called external transdiegetic, however, are not part of the gameworld, only of the gamespace. They are not integrated into the gameworld but provide information relevant for gameplay. An auditory example of this is music signaling the presence of enemies in The Elder Scrolls III: Morrowind and Dragon Age: Origins. In this section I have argued that sounds have a particular role in connecting the gamespace and the gameworld, making the boundary between the
two more seamless by using interfaces that are integrated into the gameworld in different ways. Since sound is neither tangible nor visible and has a temporary quality, it does not disrupt the sense of a unified space in the same way as alien graphical features would. It therefore seems to be easier to accept the growl of an attacking animal than it is to accept a question mark floating around in thin air. This therefore provides greater potential for designers for manipulating auditory information compared to visual information when creating user interfaces for games. The fact that gameworlds work on other premises compared to traditional fictional worlds is what makes the player accept stylistic and abstract sounds that integrate the game system into the gameworld, but this ability is also part of the reason why gameworlds are accepted as a different constructs compared to the traditional fictional worlds. This discussion also puts emphasis on the argument that talking about diegesis, and thus diegetic and non-diegetic sound, has crucial shortcomings that are avoided if we instead evaluate gamespaces on their own terms by emphasizing how gameworlds differ from other fictional worlds.
sPAtIAL INtEGrAtION OF GAME sOUND If we want to find an alternative model that describes the relative integration of sounds in gameworlds, we need to get away from the biased meaning of diegesis and instead focus on the specificities of game sound. In evaluating the usefulness of the concepts diegetic and nondiegetic in relation to game sound, I have stressed that these do not grasp how sounds are integrated into the gameworld and that they do not emphasize how sounds work as an interface providing action-relevant information to the player. In this section, I will present a game-specific approach to describing game sound that avoids the use of the diegetic/non-diegetic diad. Due to the scope of
91
Time for New Terminology?
this chapter, the model focuses on spatial integration and the difference between gameworlds and storyworlds, but it also reflects awareness of the functional aspects of game sound by looking at it as an interface, and how these aspects transcend the border of the gameworld in a meaningful way. This model puts emphasis on how well a sound is integrated into the gameworld. It builds on and supports existing theories on how we may understand gameworlds, game sound and how they work together. Grimshaw’s radical interpretation of diegesis is conserved in emphasizing the distinction between gameworld and gamespace, and we also gain new insight into the functional and integrational aspects of so-called transdiegetic sounds. Also, Galloway’s focus on games as activities is preserved as there is a heavy focus on how sounds affect gameplay in addition to the fact that gameworlds are games intended for play. Last but not least, the model avoids all confusion connected to the usage of terminology connected to the diegesis. This approach will be described in detail below. In pointing out that game sounds should be seen as an interface, it places emphasis on the usability aspects of sound in the sense that it provides information to the player such as warnings and responses as well as information relevant to game control, identification, and orientation. See Table 1. This interpretation of sound’s integration into the gameworld is based in Saunders & Novak’s separation of static and dynamic interfaces, but I believe it is more fruitful and more correct to see this separation not as a binary divide but as a continuum that integrates user interface elements
into the gameworld to a lesser or greater degree. Moreover, since sound is part of a game’s user interface, it is also possible to locate different sounds on the same continuum. In the table above, I have identified five points on this continuum where sound signals tend to be located in modern computer games. All categories have a certain degree of integration into the gameworld, with the exception of the first group which is the only one that is not part of the gameworld. I call this group metaphorical interface sounds since they are not “naturally” produced by the game universe but have a more external relationship to the gameworld, even though they also have a metaphorical similarity (Keller and Stevens, 2004) to the atmosphere and the events in it. The enemy music found in Dragon Age: Origins and The Elder Scrolls III: Morrowind are typical examples of these kinds of sounds, which are usually systemgenerated and may provide orientating and identifying information as well as working proactively as a warning to the player. The remaining four categories are all integrated into the gameworld in different ways and to different degrees. Overlay interface sounds have the same relationship to the game as Saunders & Novak’s static user interface when it is added as an overlay. These sounds are directly connected to the overlay menus, maps and action bars, and are typically generated by the player in response to his commands. These are found in most game genres but are in particular common to interfaceheavy genres like real-time strategy games. The example above is from Command & Conquer 3: Tiberium Wars (EA LA, 2007), where the player
Table 1. Game sound and world integration Metaphorical interface
Dragon Age: Origins: Enemy music
Overlay interface
C&C3: Mouseclick when selecting actions
Integrated interface
Diablo 2: Sound following boost
Emphasized interface
WoW: “Hi, you’re a tall one!”
Iconic interface
Crysis: Avatar moans when injured
92
Time for New Terminology?
typically hears the generic sound of a mouseclick every time he selects an action from any of the menus. Integrated interface sounds are typically related to user interface elements that have been placed into the gameworld, such as exclamation marks and the icons above the heads of characters. The sound played as the avatar gets a boost to stamina in Diablo 2 is a typical example of this and it is a system-generated sound that works as a notifier that also identifies the boost in question. Emphasized interface sounds have a somewhat different relationship to the gameworld as they often appear to be generated by friendly NPCs in the gameworld. An example is the lines spoken by NPCs in World of Warcraft in response to player targeting: When the goblin merchant says “Hi, you’re a tall one!” This is a sound that appears to be diegetic in the traditional sense of the term since it is something a character in the gameworld actually says, but it is in fact a system-generated sound that has been stylized and fitted into the gameworld. Iconic interface sounds, however, are completely integrated into the gameworld and correspond to Saunders and Novak’s dynamic user interface features. In terms of film theory, these sounds would be labeled diegetic as they seem to belong naturally to the universe in which they are in. They can have any kind of generator and may provide any kind of usability information. An example of an iconic interface sound is heard when the avatar moans because he is injured in Crysis. While this model is limited to solely taking into account spatial integration of game sound, it is fully compatible with my earlier models describing the usability value of a sound (Jørgensen, 2007b) and what generates a sound (Jørgensen, 2008). When combining these functions, we may study game sound along several dimensions that grasp usability on a more general level by identifying whether a sound provides responsive or urgent information and whether it is related to control functions, orientation or identification. Such a combination would be able to dive into the gameworld describing what event generates
a sound and identifying what that event means for the player’s state. Last but not least, it would take into account how the sound is integrated into the gameworld. Combined, the models will form a comprehensive and detailed analytical tool that describes all gameplay related sounds in computer games, without creating the confusing association to traditional diegeses.
cONcLUsION When sounds work functionally in the sense of providing gameplay-relevant information to the player, it must be seen as part of the user interface of a game. In this respect, we need to acknowledge its status as such and use an approach that allows us to describe it in terms of an interface. However, the traditional distinction between diegetic and non-diegetic is not based on participatory use and does not allow us to describe game sound in this way. This article presents a game-oriented alternative to diegetic and non-diegetic that takes into account spatial integration of sounds from a gameplay perspective. The model is also compatible with earlier models characterizing game sound (Jørgensen, 2007a, 2008, 2009) and together they form a framework that allows us to describe the interface aspects of computer game sounds while also paying equal attention to its relationship to the gameworld as an environment that reminds of those of fiction but instead is built on game rules. While this chapter argues for substitution of the terms diegesis, diegetic and non-diegetic when discussing sound in games, it should be stressed that these terms may be fruitful in some respects. They may be used when a scholar wants to compare computer games and game sound with other media and they may also be used the way this chapter does; to show why they are problematic. From these perspectives, Galloway’s, Ekman’s, Grimshaw’s and Jørgensen’s earlier work on the subject are important contributions that are especially fruitful for those seeking to
93
Time for New Terminology?
understand how game sound and gameworlds differ from other media. It is, however, important to emphasize the fact that spatiality in computer games operates on very different premises than in film, for instance, and that we talk about a different relationship between sound and environment compared to the traditional separation between diegetic and non-diegetic. A crucial difference is that gameworlds are different constructs from traditional fictional worlds and this must be taken into consideration when discussing the origin of sounds and other features. It is important to note that the model presented here is not limited to the study of game sound but that it may be used to analyze all interface-related features of a computer game. However, sound is particularly interesting because of its seamless integration and its ability to remain non-intrusive even when it tends to break with the conventions of the gameworld. It should also be mentioned that the framework is supposed to work as a tool to help us better understand how game sound and other game features operate, and as such, it will always be subject to modification.
Berndt, A. (2011). Diegetic music: New interactive experiences . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global.
AcKNOWLEDGMENt
Command & conquer 3: Tiberium wars. (2007). EA Games.
Thanks to Jesper Juul, Matthew Weise, Mark Grimshaw and the anonymous review committee for comments.
Crysis. (2007). EA Games, Crytek.
Bordwell, D. (1986). Narration in the fiction film. London: Routledge. Bordwell, D., & Thompson, K. (1997). Film art. An introduction to film theory. New York: MacGraw-Hill. Branigan, E. (1992). Narrative comprehension and film. London: Routledge. Chion, M. (1994). Audio-vision, sound on screen. New York: Columbia University Press. Collins, K. (2007). An introduction to the participatory and non-linear aspects of video games audio . In Hawkins, S., & Richardson, J. (Eds.), Essays on sound and vision. Helsinki: Helsinki University Press. Collins, K. (2008). Game sound: An introduction to the history, theory, and practice of video game music and sound design. Cambridge, MA: MIT Press.
Diablo 2. (2000). Blizzard Entertainment. Dragon age: Origins. (2009). EA Games, Bioware.
rEFErENcEs Aarseth, E. (2005). Doors and perception: Fiction vs. simulation in games. In Proceedings of 6th Digital Arts and Culture Conference 2005. Aarseth, E. (2008). A hollow world: World of Warcraft as spatial practice . In Corneliussen, H., & Rettberg, J. W. (Eds.), Digital culture, play and identity: A World of Warcraft reader. Cambridge, MA: MIT Press.
94
Droumeva, M. (2011). An acoustic communication framework for game sound: Fidelity, verisimilitude, ecology . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global. Ekman, I. (2005). Meaningful noise: Understanding sound effects in computer games. Paper from DAC 2005. Retrieved January 12, 2009, from http://www.uta.fi/~ie60766/work/DAC2005_Ekman.pdf.
Time for New Terminology?
Electroplankton. (2006). Nintendo, Indies Zero. Galloway, A. R. (2006). Gaming: Essays on algorithmic culture. Electronic mediations (Vol. 18). Minneapolis, London: University of Minnesota Press. Genette, G. (1983). Narrative discourse: An essay in method. Ithaca, NY: Cornell University Press. Gorbman, C. (1987). Unheard melodies? Narrative film music. Bloomington: Indiana University Press. Grimshaw, M. (2008). The acoustic ecology of the first-person shooter. City, Country: VDM Verlag. Grimshaw, M., & Schott, G. (2007). Situating gaming as a sonic experience: The acoustic ecology of first person shooters . In Proceedings of DiGRA 2007. Situated Play. Half-Life 2. (2004). Sierra Entertainment, Valve Corporation. Huiberts, S., & van Tol, R. (2008). IEZA: A framework for game audio. In Gamasutra. Retrieved January 12, 2010, from http://www.gamasutra. com/view/feature/3509/ieza_a_framework_for_ game_audio.php. Huizinga, J. (1955). Homo ludens: A study of the play element in culture. Boston: Beacon Press.
Jørgensen, K. (2008). Audio and gameplay: An analysis of PvP battlegrounds in World of Warcraft. In Gamestudies, 8(2). Retrieved January 12, 2010, from http://gamestudies.org/0802/articles/ jorgensen. Jørgensen, K. (2009). A comprehensive study of sound in computer games. Lewiston, NY: Edwin Mellen Press. Juul, J. (2005). Half-real. Video games between real rules and fictional worlds. Cambridge, MA: MIT Press. Keller, P., & Stevens, C. (2004). Meaning from environmental sounds: Types of signal-referent relations and their effect on recognizing auditory icons. Journal of Experimental Psychology. Applied, 10(1). doi:10.1037/1076-898X.10.1.3 Klevjer, R. (2007). What is the avatar? Fiction and embodiment in avatar-based singleplayer computer games. Unpublished doctoral dissertation. University of Bergen, Norway. Nacke, L., & Grimshaw, M. (2011). Player-game interaction through affective sound . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global. Rock band. (2007). EA Games.
Jørgensen, K. (2006). On the functional aspects of computer game audio. In Proceedings of the Audio Mostly Conference (pp. 48-52).
Salen, K., & Zimmermann, E. (2004). Rules of play: Game design fundamentals. Cambridge, MA: MIT Press.
Jørgensen, K. (2007a). ‘What are these grunts and growls over there?’ Computer game audio and player action. Unpublished doctoral dissertation. Copenhagen University, Denmark.
Saunders, K., & Novak, J. (2006). Game development essentials: Game interface design. Stamford, CT: Cengage Learning.
Jørgensen, K. (2007b). On transdiegetic sounds in computer games. Northern lights Vol. 5: Digital aesthetics and communication. Intellect Publications.
Stockburger, A. (2003). The game environment from an auditory perspective. In M. Copier & J. Raessens (Eds.), Proceedings of Level Up: Digital Games Research Conference. Tarantino, Q. (1994). Pulp fiction. Miramax.
95
Time for New Terminology?
TheElder Scrolls III: Morrowind. (2002). Bethesda Softworks. Warcraft 3: Reign of chaos. (2002). Blizzard Entertainment. Whalen, Z. (2004). Play along: An approach to video game music. Gamestudies, 4(1). Retrieved January 12, 2010, from http://www.gamestudies. org/0401/whalen. Wilhelmsson, U., & Wallén, J. (2011). A combined model for the structuring of game audio . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global. World of Warcraft. (2004). Vivendi Games.
KEY tErMs AND DEFINItIONs Diegesis: Originally referring to pure narrative, or situations in which the author is the communicating agent of a narrative, diegesis was revived in the 1950s to describe the “recounted story” of a film. It is today the accepted term in film theory to refer to the fictional world of the story. Diegetic: That which is part of the depicted fictional world. Diegetic sounds are thus sounds that have a source in the fictional world. Game System: The formal structure of the game consisting of a set of features that affect each other to form a pattern. Includes the rules of a game and the mechanisms that decide how the rules interact. Gamespace: The conceptual space or arena in which a game is played, independent of any possible fictional universe in which it may be set. Gamespace is defined by the magic circle, and includes potentially all elements relevant for playing, regardless of whether they are part of the original system or not. Gameworld: A unified and self-contained universe that is functionally and environmentally
96
designed for the purpose of playing a specific game. Gameworlds are oriented towards a specific gameplay experience and do not need to be explained as a credible part of a hypothetical world. Metaphorical Interface Sounds: Sounds that provide usability information to the player while being placed external to the gameworld. An example is adaptive music which informs the player that an enemy is approaching. Non-Diegetic: That which is external to the fictional world. Non-diegetic sounds are thus sounds represented as coming from a source outside the fictional world. Overlay Interface Sounds: Sounds that are associated with the overlay interface placed as a filter on top of the gameworld. An example is the sound of mouseclicks whenever the player makes a selection from the action bar. Transdiegetic: Transdiegetic features are auditory and visual elements of a computer game which transcend the traditional division between diegetic and non-diegetic by way of merging system information with the gameworld. Transdiegetic features thus create a frame of communication that has usability value at the same time as they are integrated into the represented universe of the game. Integrated Interface Sounds: Sounds that are connected to user interface elements that have been placed inside the gameworld for usability purposes. An example is system-generated sounds that follow the player’s collecting of coins, boosts or other prizes. Emphasized Interface Sounds: Sounds that have been stylized and fitted into the gameworld while also remaining clear system-generated features. Examples are the auditory responses from units being selected in strategy games. Iconic Interface Sounds: System-generated sounds that are completely integrated into the gameworld as if they were natural to that universe. An example is the sound of weapon use in a game.
Time for New Terminology?
ENDNOtEs 1
2
3
All quotes are originally in Norwegian, and have been translated by the author. MMO is short for Massively Multiplayer Online games. These are games in which thousands of players play together on online servers. Originally a military technology, HUD is short for heads-up display which is “an electronic display of instrument data projected at eye level so that a driver or pilot sees it without looking away from the road or course” (Random House Dictionary, 2009).
4
As the formal structure of the game, the game system seems to lie somewhere in between the gamespace and the gameworld. While talk between players during group play in the same physical space would be part of the gamespace, this kind of communication is not an actual part of the formal game system. However, so-called external transdiegetic features such as music signalling incoming enemies, are clearly part of the game system even though they are not part of the gameworld.
97
98
Chapter 6
A Combined Model for the Structuring of Computer Game Audio Ulf Wilhelmsson University of Skövde, Sweden Jacob Wallén Freelance Game Audio Designer, Sweden
AbstrAct This chapter presents a model for the structuring of computer game audio building on the IEZA-framework (Huiberts & van Tol, 2008), Murch’s (1998) conceptual model for the production of film sound, and the affordance theory put forth by Gibson (1977/1986). This model makes it possible to plan the audio layering of computer games in terms of the relationship between encoded and embodied sounds, cognitive load, the functionality of the sounds in computer games, the relative loudness between sounds, and the dominant frequency range of all the different sounds. The chapter uses the combined model to provide exemplifying analyses of three computer games—F.E.A.R., Warcraft III, and Legend of Zelda—. Furthermore, the chapter shows how a sound designer can use the suggested model as a production toolset to structure computer game audio from a game design document.
INtrODUctION Computer game audio is an often neglected area when analyzing and producing computer games (Cancellaro, 2006; Childs, 2007; Marks 2001). The same seems to be the case when analyzing or producing movies (Murch, 1998; Thom, 1999;). There is a general lack of functional models, for the analysis as well as the production of computer game audio, even though some good examples DOI: 10.4018/978-1-61692-828-5.ch006
of functional models, such as Sander Huiberts and Richard van Tol’s (2008) IEZA-framework (Figure 1), are available. The IEZA-framework is also discussed in Droumeva (2011). In this chapter, we use the IEZA-framework in combination with Walter Murch’s (1998) conceptual model for film sound (Figure 2). Why combine these two different areas, that is, a model concerned with computer game audio and another with film sound? As Huiberts and van Tol (2008) have noted, film sound is a “field of knowledge that is closely related to game audio”
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Combined Model for the Structuring of Computer Game Audio
Figure 1. Huiberts and van Tol’s IEZA-framework for the analysis and production of computer game audio into which we have added frames for the different categories. Adapted from Huiberts and van Tol (2008)
(p. 2). Although these two areas are related and do share some common ground, they are also quite different in many ways. It is striking that when we think about games we use the term audio, yet when we think about film we seem to primarily use the term sound. In our opinion, there is a difference between these 2; audio is a more technology-based term than sound. A sound is something you hear which in turn leads to listening, while audio is something that precedes sound but with stronger technological connotations as a term. Film sound is, as Murch notes, normally composites of sound in several layers, an assertion which precedes a more thorough discussion of this model (Figure 2). Murch concludes that we may be wise to limit those layers to a total of 5 different ones simultaneously played back on the sound track of a movie. A common method of separating the different parts of film sound is a typology consisting of 3 separate categories: speech, effects, and music (Bordwell & Thompson, 2001; Sobchack & Sobchack, 1980). This typology is originally based on the technology of early sound films and its 3 tracks, constituting a practice-oriented separation of sound into different categories. It also corresponds well to Murch’s
(1998) conceptual color model (Figure 2), which spans from language that clearly has relations to speech (encoded) via effects to music (embodied). With such a typology rather clearly differentiating the 3 basic entities of film sound from each other, we might jump to the conclusion that film sound is fairly easy to create and that computer game audio could be modeled, more or less, on the practice and theories of film sound. Since we only have 3 basic categories of sounds that can be used and combined to create a sonic environment, how hard can it really be? However, film sound is more complex than this initial typology suggests and we address this in this chapter. Furthermore, computer game audio works under quite different conditions than film sound does: film sound is fixed, stored and played linearly. This does not, however, mean that sound in movies needs to be synchronous with the visual, since it might be narrating at a different level that does not have its basis in the present image (Hug, 2011; Kubelka, 1998; Pudovkin, 1929/1985). Computer game audio, on the other hand, is dynamic and stored as a resource for the player to use in a non-linear fashion. An invariant set-up of sounds is stored in a database, but the use of objects that would
99
A Combined Model for the Structuring of Computer Game Audio
Figure 2. Murch’s conceptual model. Adapted from Murch (1998)
produce these sounds while the game is being played is likely to be highly variable if the game is not to become extremely linear in its progression and very boring to play (see Farnell, 2011; Mullan, 2011 for technologies offering the chance to break from this paradigm). The typology for film sound and its 3 categories can also be compared to how the human auditory system is biased. Humans are biased towards listening for voices (Chion, 1994, p. 6), and towards attempting to interpret voices as words of language, and spoken language is a primary resource for communication. Humans are generally most sensitive in the part of the sound spectrum occupied by the human voice, that is, approximately 150 to 6000 Hz, and especially sensitive within the range of 3000 to 4000 Hz. Spoken language occupies a quite broad part of the sound spectrum in which the threshold is low. In movies and games this part of the sound spectrum is also commonly inhabited by concurrent sounds, such as explosions, music, and so on, which have a natural broad spectrum.1 A voice does not need the same level as a low pitched boom in order to be perceived as having the same loudness. There are a number of reasons as to why some sounds fuse together into one and why, in some cases, they do not. These include frequency, relative amplitude, timbre, onset, amplitude envelope, and sound source location: Sounds that
100
fuse tend to have one or several of these factors within the same, small range. Additionally, one should not forget the active ability of humans to focus on particular sounds to the exclusion of others: what is commonly referred to as the cocktail party effect. In this chapter we can not discuss the whole field of acoustics and psychoacoustics but will need to focus the attention towards a limited number of issues concerning the complexity of sound such as dynamics, relative amplitude, dominant frequencies, and their relation to semantic value. 2 For the time being, we can conclude that if there are many sounds with the same properties, that clarity might then become a problem and, at worst, the mix will become blurred or distorted. Therefore, it can be useful to consider what types of sounds have already been used when designing a sonic environment, for which our model can be a powerful toolset. Will broad dynamics thus create good sound design by itself? The answer is obviously no. If two or more sounds are played simultaneously, they may blend and be heard as one. In theory, the more the sounds differ in a dominant frequency span and relative loudness, the easier it becomes to distinguish them into 2 different sounds with different semantic values. Reality, on the other hand, is not that simple. In games, 2 audio files can typically be played together in innumerable ways and with different timings. Consequently,
A Combined Model for the Structuring of Computer Game Audio
one of the key problems of computer game audio is the loss of control that a sound designer has over the playback of the sound in the gameplay of a complex game. Two or considerably more different (or identical) audio files might be played simultaneously due to gameplay events induced by the player or the game system. This could lead to a chaotic sonic environment, the “logjam” of sound that Murch (1998) describes in relation to film sound (see also Cancellaro, 2006; Childs, 2007; Marks, 2001; Prince, 1996; Wallén, 2008). This “logjam” does not support or enhance gameplay and may also become very tiresome to listen to during even short sessions of play. In order to avoid that sounds lose their definition and thereby their semantic value, a game audio designer needs to plan and structure the game audio as much as possible, which constitutes a major problem. The sound designer can design and deliver the sounds to a game but the player is the one person in control of the play button. The goal for a sound designer should be to retain as much control over the final sonic environment as possible, even though it is hard to define exactly when the sounds are going to be played. Since game sounds are usually loaded on call by certain events in the game, the sounds cannot be edited and mixed in a fashion similar to the mixing of film sound. In other words, the sonic environment has to be spread out beforehand. To avoid a big, undefined wall of sound, the sounds have to be somewhat compatible with each other. One could compare sound layers with a jig-saw puzzle; in order to complete it, each part must fit in with the surrounding parts. If a number of pieces are put onto each other, the parts at the bottom will be covered and not clearly visible. On the other hand, as Chion (1994) has noted, sounds may be superimposed on top of each other without the conceptualization that they stem from different environments (pp. 45--46). The problem is that sounds which are similar will blur into one another. By using the entire dynamic and frequency range, as well as the panorama and distribution of the cognitive
load over the brain, the sonic environment is more likely to be clear and distinct. Perhaps every sound does not have to be as loud as possible (Thom, 1999)? If every sound is evaluated and then given values for a set of variables, such as dynamic range, dominant frequency, and cognitive load, the sonic environment can be easier to visualize. This is what our combined model does. So far we have identified a number of key problems in the analysis and production of computer game audio: • • •
•
•
There is a general lack of functional models for the analysis of computer game audio There is also a general lack of functional models for the production of game audio The loss of control that a sound designer has over the playback of the audio in the gameplay of a complex game may lead to a chaotic blur of sounds which makes them lose their definition and hence their semantic value When two or more sounds play simultaneously, the clarity of the mix depends on the type of sounds, which leads to The nature of the relationship between encoded and embodied sounds. Furthermore:
•
Sound is often an abstract to game designers, graphical artists and programmers, due to a lack of consistent and communicable terminology.
The overall purpose of this chapter is therefore to present a model (Figures 3 to 8) that solves these problems and makes it possible to plan the audio layering of computer games in terms of: • •
The relationship between encoded and embodied sounds Cognitive load
101
A Combined Model for the Structuring of Computer Game Audio
Figure 3. The structure of the combined model of computer game audio (without the addition of specific sounds)
• • •
The functionality of the sounds in computer games The relative loudness between sounds and The dominant frequency range of all the different sounds.
We also try to establish a consistent communicable terminology for the analysis and production of computer game audio. The suggested model (Figure 3) combines Huiberts and van Tol’s (2008) IEZA-framework (Figure 1) with Murch’s (1998) conceptual model (Figure 2). This combined model may be used to plan the sonic environment of computer games in a complete and balanced way, that is, balanced in relation to the sound spectrum available and complete in relation to the visual component of the game. The model constitutes a tool that provides a sonic richness and avoids “the logjam of sounds” (Murch, 1998). It may also be used to
102
analyze computer game audio, and a number of analyses showing the benefits of this are provided. Through the use of this model, a sound designer will be able to clearly understand how different kinds of games emphasize different parts of the audio due to the genre and gameplay principles. This chapter is structured as follows: We first present the two models for game audio that we have tried to combine, starting with the IEZA-framework for computer game audio and proceeding to Murch’s conceptual model for film sound. This is followed by a presentation of the combined model, how it is structured and what kind of problems it can solve. In order to provide a more theoretical approach to the complexity of computer game audio, a discussion concerned with playing computer games and listening to the sounds, which is sustained by a case study, follows. We then provide 3 exemplifying analyses of existing computer games, F.E.A.R. (Monolith Productions, 2005), Warcraft III (Blizzard Enter-
A Combined Model for the Structuring of Computer Game Audio
tainment, 2002) and Legend of Zelda (Nintendo, 1987), to show how the combined model, as an analytical toolset, may be applied. Since this model is also suitable for the production of game audio, we then provide an example of how to actually use it as a production toolset. The final section is a summary of this chapter and our concluding thoughts.
the IEZA-Framework Although we show that sound is closely related to immersion, most literature on game audio does not deal with fundamental questions, such as those related to what game audio really is, what it consists of and what makes it function in games. It is striking that in this emerging field, theory on game audio is still rather scarce. While most literature focuses on the production and implementation of game audio, like recording techniques and programming of sound engines, surprisingly little has been written in the field of ludology about the structure and composition of game audio. (Huiberts and van Tol, 2008, p. 1) Huiberts and van Tol were looking for a functional and coherent framework to use for the study of game audio and examined different categorization methods, from games and films respectively. However, they found that none provided any sensible information about the organization and functionality of the audio. This is a problem since the functionality of sound is essential to computer games. While this model, in its original form, does not specifically discuss the semantics of sounds in a detailed way, our combined model emphasizes this important issue.3 Huiberts and van Tol propose that a more coherent way to categorize the audio in a game should also include the function, role and properties of the different sounds. They therefore developed the IEZA-framework (Figure 1) for the categorization and planning of audio in computer games.
The IEZA-framework consists of 4 categories: 1.
2.
3.
4.
Interface: Sounds related to the game’s interface. Interface sounds are non-diegetic and belong to the game as a system. Effect: Sounds directly or indirectly triggered by the player’s actions. The sounds of effects are diegetic and the result of activity within the game environment. Zone: Sounds related to the game environment. Zone sounds are diegetic and belong to the setting. Affect: Sounds outside the game environment, mainly intended to set the mood. The sounds of affects are non-diegetic and often used to create anticipation.
The 4 categories are divided into 2 axes in a cross pattern: diegetic versus non-diegetic in the vertical axis and activity versus setting in the horizontal axis. The terms diegetic and non-diegetic are also very often used in film theory (Bordwell & Thompson, 1994; Bordwell & Thompson, 2001; Chion, 1994; Wilhelmsson, 2001) and diversify the environment inside the movie/game, that is, the diegesis, versus the system that carries this world inside the movie/game, that is, the nondiegetic (Cunningham, Grout, & Picking, 2011; Jørgensen, 2011). The IEZA-framework makes a clear distinction between the sounds that belong inside the game environment, the Zone and the Effect sounds, and the sounds that belong to the system as such, the Affect and Interface sounds. The horizontal axis places the sounds on a scale of setting versus activity. The Zone and Affect sounds belong to the setting of the game and the Effect and Interface to the activities during gameplay. This is a good starting point for understanding how computer game audio may be categorized in accordance with its functionality within the sonic environment of a specific game. We agree with Huiberts and van Tol (2008) that this structure enables the IEZA-framework to go deeper than other similar frameworks. We have used the IEZA-
103
A Combined Model for the Structuring of Computer Game Audio
framework successfully in game audio courses at the University of Skövde with good and promising results. Nevertheless, we have also realized that the model does not cover all the important issues with regard to creating a sonic richness and at the same time avoiding the smearing of sound all over the sonic environment. The key problem is that the IEZA-framework does not, in itself, produce a visualization of the cognitive load, the relation between the semantic value of different sounds, the relation between encoded and embodied sounds, the dominant frequencies of a sound file or its loudness. Combined with Murch’s (1998) conceptual model the IEZA-framework can be part of a more elaborate tool for the production and the analysis of computer games. We have now covered the first node of our combined model and it is time to take a closer look at the second node: Murch’s conceptual model of film sound.
MUrcH’s cONcEPtUAL MODEL One central point made by Murch in his work on the conceptual model (1998) is that just as audible sound may be placed on a scale ranging from approximately 20 Hz to 20,000 Hz, a sound may also be placed on a conceptual scale from Encoded to Embodied covering a spectrum from speech to music via sound effects in order to avoid a “logjam” of sounds. This dimension of film sound is the reason for our choice of Murch’s conceptual model as the second node of our combined model of computer game audio. The IEZA-framework does not, in itself, categorize the different sounds on a scale from encoded to embodied, and no references to Murch’s conceptual model of film sound are made in Huiberts’ and van Tol’s article (2008).
Example from Murch (1998) 1. 2.
104
Violet – Dialogue Cyan/Green – Linguistic/Rhythmic Effects (e.g. footsteps, door knocks etc)
3. 4. 5.
Yellow – Equally Balanced Effects Orange – Musical Effects (e.g. atmospheric tonalities) Red – Music.
Before addressing Murch’s conceptual model we need to elaborate the statement that humans are biased towards listening for voices (Chion, 1994, p. 6). Chion states that: “Sound in film is, above all, voco- and verbocentric because human beings in their habitual behavior are as well” (p. 6). He suggests 3 different listening modes: causal, semantic and reduced listening. We first listen in order to identify the cause of a sound—causal listening—and, when identified, we listen to find the meaning of the sound—semantic listening. Reduced listening is a special case that is not discussed in this chapter. Therefore, what is Chion’s suggestion about how listening to a cinematic soundtrack works with regard to the 3 different types of sound, that is, speech, effects, and music? If the scene has dialogue, our hearing analyzes the vocal flow into sentences, words-hence, linguistic units. Our perceptual breakdown of noises will proceed by distinguishing sound events, the more easily if there are isolated sounds. For a piece of music we identify the melodies, themes, and rhythmic patterns, to the extent that our musical training permits. In other words, we hear as usual, in units not specific to cinema that depend entirely on the type of sound and the chosen level of listening (semantic, causal, reduced). The same thing obtains if we are obliged to separate out sounds in the superimposition and not in their succession. In order to do so we draw on a multitude of indices and levels of listening: differentiating masses and acoustic qualities, doing causal listening, and so on. (Chion, 1994, p. 45)
A Combined Model for the Structuring of Computer Game Audio
What does this then mean? Voco- and verbocentrism relates to 2 of the 3 listening modes suggested by Chion (1994), that is, causal listening and semantic listening. In Chion’s work, the part on semantic listening is very brief and only discusses semantic value in relation to a code or spoken language. However, it is fruitful to also use this concept in relation to the system of sounds that a given movie or a given game puts forth which may be understood as a semiotic system consisting of sound signs. Within such a system the sounds are part of the communication of the environment. This line of thought is also found in Murch’s conceptual model. According to Murch (1998), the clearest example of encoded sound is speech and the clearest example of embodied sound is music. Furthermore, since the human brain normally divides the processing of sound (and other stimuli) between the left and right side of the brain, we are able to discern 5 different layers of sound simultaneously, if they are evenly spread on the conceptual spectrum from encoded/violet to embodied/red. Murch (1998) provides a number of practice-based examples and problems from his work on the film Apocalypse Now! (Coppola, 1979): […] it appeared to be caused by having six layers of sound, and six layers is essentially the same as sixteen, or sixty: I had passed a threshold beyond which the sounds congeal into a new singularity dense noise in which a fragment or two can perhaps be distinguished, but not the developmental lines of the layers themselves. With six layers, I had achieved Density, but at the expense of Clarity. What I did as a result was to restrict the layers for that section of film to a maximum of five. By luck or by design, I could do this because my sounds were spread evenly across the conceptual spectrum. Murch’s problem in this case concerned the 6 concurrent layers described below:
1. 2. 3. 4. 5. 6.
Dialogue (violet) Small arms fire (blue-green ‘words’ which say “Shot! Shot! Shot!”) Explosions (yellow “kettle drums” with content) Footsteps and miscellaneous (blue to orange) Helicopters (orange music-like drones) Valkyries Music (red).
Something had to be sacrificed whilst maintaining density and clarity and Murch therefore decided to omit the music and have a five layer soundtrack consisting of: 1. 2. 3. 4. 5.
Dialogue (“I’m not going! I’m not going!”) Other voices, shouts, etc. Helicopters AK-47’s and M-16s Mortar fire.
In Murch’s (1998) example, the instances of “small arms fire” are effect sounds with a semantic value that Murch calls “blue-green ‘words’ which say Shot! Shot! Shot!”. Firing a gun in a game would typically result in a direct response soundwise. The player would also probably anticipate such a feedback. In addition, firing a gun would conceptually evoke a sound and some kind of visual response as well. Sound reveals something about the environment. In this case, it signals the presence of guns and a potential danger; it is a sign that denotes clear and present danger. Humans tend to seek meaning and structure in and from the surrounding environment. In this case, we try to identify the source of a sound and what it might mean in the present context. We have used both causal and semantic listening and found a specific sound among others in the concurrent audio layers. The sound as such is dense and clear at the same time since it is carefully planned to occupy a specific frequency range and to develop within a specific part of the dynamic range. Is Murch on the right track with his conceptual model and his conclusion that five concurrent
105
A Combined Model for the Structuring of Computer Game Audio
layers of sound is the maximum for obtaining density and clarity? Murch’s conceptual model corresponds well to Miller’s (1956)The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. As humans, we have limitations with regard to processing data. Miller (1956) discussed this in terms of bits and chunks: If the human observer is a reasonable kind of communication system, then when we increase the amount of input information the transmitted information will increase at first and will eventually level off at some asymptotic value. This asymptotic value we take to be the channel capacity of the observer: it represents the greatest amount of information that he can give us about the stimulus on the basis of an absolute judgment. The channel capacity is the upper limit on the extent to which the observer can match his responses to the stimuli we give him. He used Pollack’s (1952, 1953) work on auditory displays to discuss and explain absolute judgment of unidimensional stimuli which clearly showed the channel capacity for pitch to be “2.5 bits which corresponds to about six equally likely alternatives” (Miller, 1956). It is interesting to note that sound was the focus for this groundbreaking work on the human capacity for processing information. The combination of moving images and sonic environment makes up the setting in which the actions of a movie or a game take place. Watching a movie without any sound added is often somewhat dull. Our experience is that an audience trying to become immersed in old silent movies without any preserved soundtrack grows bored and separated from the events on the screen. This is in accordance with Walter Ong’s (1982/90) remarks on the bipolarity of sight and hearing which we later elaborate in the discussion about playing computer games and listening to the sounds. Adding a musical soundtrack heightens
106
the immersion radically, and even an old film such as The Phantom Chariot (Sjöström, 1921), with a musical soundtrack composed by Matti Bye in 1998, becomes an interesting movie. This clearly exemplifies that sound, even if it is only music, has the effect of including the audience in the environment and that the moving images do not, in themselves, have the same desired effect, which supports Ong’s claim on the bipolarity of vision and hearing. Only music… well, of course, music should not be solely considered as embodied rather than encoded. Music is a plethora of systems. It can be narrative and it contains many cultural dependent codes. However, Murch’s point is that music works rapidly and is usually aimed more at our emotive rather than our intellectual response. There are, of course, differences between individuals. “For a piece of music we identify the melodies, themes, and rhythmic patterns, to the extent that our musical training permits” (Chion, 1994, p. 45). In the case of Bye’s musical score for The Phantom Chariot, much of the music mimics other kinds of sounds, such as the squeaks of the chariot’s wheels, and therefore the music is also clearly semantic. Bye’s music can very well be said to make use of a scale of sounds from encoded to embodied that comprises the soundtrack. Thus far we can conclude that Murch’s model fits well into the idea of an upper limitation of a simultaneous processing capacity that has been thoroughly investigated since at least 1956 and onwards. Furthermore, the conceptual model he suggests makes it possible to consider sound on a scale spanning from encoded to embodied. This in turn implies that if such a scale, spanning from encoded to embodied sound, were to be used in combination with the IEZA-framework, it would be much easier to structure a sonic environment for a computer game. We also have a new set of parameters that add content to the sound categories suggested by the IEZA-framework. This content is the level of meaning a specific sound carries. Meaning, or semantic value, is not only carried by sonic signs such as the spoken words, utter-
A Combined Model for the Structuring of Computer Game Audio
ances and so on of language production, although language and its use is a prototypical example of highly encoded sounds which Murch’s model emphasizes.
tHE cOMbINED MODEL FOr GAME AUDIO This section introduces the different parts of the combined model for the layering of computer game audio. The combined model makes it possible to categorize the different sounds for any part of a game in a number of ways. Such a categorization could span from relative dynamic range (dominant frequency areas, “encoded sound” versus “embodied sound” (Murch, 1998)) to whether a sound belongs to the diegesis of the game, if it is part of the interface, belongs to the activity of playing, or to the setting in general. If, for example, many “encoded sounds” are used, such as spoken language in a game, it is necessary to be attentive to the total sonic environment in which these “encoded sounds” take place and plan for an acoustic niche for the dialogue with few interfering sounds played simultaneously within the same frequency span. If many “embodied sounds” are needed, such as music combined with ambient sounds designating the environment, it will be necessary to make them work together by shaping the sounds to fit and allow each other concurrent presence. As Figures 1 to 3 above show, we have taken the basic differentiation of the game audio divided into Interface, Effect, Zone and Affect sounds from the IEZA-framework. We have also used, from the IEZA-framework, the horizontal axis that differentiates sounds on the setting versus activity scale and the vertical axis that describes sound as diegetic or non-diegetic from the original IEZA-framework. The IEZA-framework is intact within our model; Murch’s conceptual model has, however, been visually adapted. The centre of the circle equates to the left-hand foot of Murch’s arch (violet/encoded) and, moving
away from the centre, Murch’s spectrum is then traversed until the periphery of the circle which equates to red/embodied. The more central a sound is placed, the higher its level of encoding; the more peripheral a sound is the lower its level of encoding. This is a clear difference from both the IEZA-framework and Murch’s conceptual model which do not themselves allow such a visual differentiation in the first case and in the second do not, apart from Effects, place the sounds in specific categories (such as Interface, Zone and Affect) nor place the sound on the vertical axis of diegetic versus non-diegetic or the horizontal axis of setting versus activity. Murch does write about the setting versus activity scale, but his conceptual model does not have a structure that visualizes this aspect clearly. The effect of combining these two models, that is, the IEZAframework and the conceptual model, will make it easier to understand what is happening in the sound environment beforehand, in a more detailed manner. The sound designer does not need to use actual sounds: They may be derived from a game script or story board prior to production.4 If the sound designer has lots of sounds in the centre of the model, she is most likely to produce a cognitive overload for the player because the sounds in the centre are encoded and need more intellectual processing to be meaningful and distinct. Since controlling dominant frequencies is one way to distinguish1sound stream from another, we have chosen to make this quality of sound visible within the model. We have also chosen to use 3 basic primitives, as Figures 4 and 5 illustrates: • • •
A circle = a sound in which the bass frequencies are dominant A square = a sound in which the midrange frequencies are dominant A triangle = a sound in which the treble/ high frequencies are dominant.
These 3 basic primitives were chosen since they seemed natural, but this is not to say that
107
A Combined Model for the Structuring of Computer Game Audio
Figure 4. The different loudness primitives used in the combined model ranging from very low to very loud. Midway should be about normal speech level, assuming that the game’s sound is somewhat dynamic. The scale is, in other words, relative to the specific game’s loudest and quietest sounds
Figure 5. F.E.A.R. analysis example
they would be the natural shapes to represent these qualities: They are more likely the result of a process of cultural connotation than anything else. A bass sound does not have the same sharpness as a midrange sound. It is often hard to hear from
108
what place a bass sound originates, which is why a circle was chosen. However, a midrange sound has a high degree of definition and is distinct, which is the reason a square was chosen. Furthermore, a triangle was chosen for the treble sound, which
A Combined Model for the Structuring of Computer Game Audio
is sharp and often pointy.5 But we also need yet another dimension and that is loudness. These primitives have therefore been made in 5 different sizes designating their relative loudness: A larger primitive represents a louder sound. The model is able to show the following aspects: • • • • • • • • • •
The amount and clustering of encoded sounds The amount and clustering of embodied sounds The amount and clustering of diegetic sounds The amount and clustering of non-diegetic sounds The amount and clustering of Interface sounds The amount and clustering of Effect sounds The amount and clustering of Zone sounds The amount and clustering of Affect sounds The relative loudness between sounds The dominant frequencies in each sound.
The parameters above help the sound designer avoid cognitive overload due to a logjam of sounds. We have now described how the combined model is structured. Before explaining how the sound designer can use it to analyze and/or structure the sonic environments in computer games, we need to elaborate on the discussion of the process of playing computer games and listening to the sonic component of the environment as part of game playing.
Playing computer Games and Listening to the sounds What is game playing? How does a player act and why does she act the way she does? What role does sound have in the playing of a computer game? Can sound be used to manipulate the player into acting in specific ways?
Playing a computer game involves the manipulation of objects within the game environment in a dynamic, sequential flow of events. During play, the player will be processing a lot of data that will need to be made meaningful in order to proceed within the game environment and the game as a system. The player will need to identify the data and turn it into information by categorizing the graphical as well as the sound elements. Objects in a game may be connected to a corresponding sound that carries meaning, that is, there is a semantic level in the sounds and the graphics that is fundamental to gameplay as such. Scripted sounds, or series of sounds, on the other hand, are the result of a player’s position within the environment rather than a specific gameplay action taken by the player. For example, a player reaches a certain point in the game environment and a sound starts. The player has not taken any conscious action and hence does not anticipate any specific feedback. The conceptual spectrum induces different kinds of anticipation. You might very well use scripted sounds to make the player take action, for example, by letting the player walk in a narrow corridor with glass walls on1side. When the player reaches a specific point of observation, you script a sound event to occur (Gibson, 1986) also adding a sudden motion seen through the glass wall. If the gameplay is based on survival/horror, you might induce the player to waste a number of bullets on the supposition that the sound and the motion imply danger is present, that is, you scare her into action through a scripted event. In a film everything is placed in a comparable scripted order when the editing process has been concluded and the film is completed. However, in a computer game, the total environment must support the gameplay. It may contain cut scenes, scripted events as well as those based on player action. In the first case you have the same control as in traditional movies, while in the second you preplan an event to occur at a specific point in the game environment. For the last case, a
109
A Combined Model for the Structuring of Computer Game Audio
number of options are available soundwise that can be supplied to the player through a database. As an example, you can, and probably should, limit the number of weapons accessible to the player. Every single weapon should be discernible from any other weapon through its sound in order to enhance the semantic value. Our point here is that the player is supplied with a number of objects with which to play the game. Many of them produce sound effects within the diegesis of the game, such as shots. As this example shows, the IEZA-framework is useful in this part of the process, discerning what kind of sound belongs where in the game’s structure. You have some, but not total, control over when, why and where the player will use these play objects. In the above, our focus is on the sonic environment of computer games and the problem of balancing the sounds in relation to each other. However, very few games consist of sound alone (notable examples can be found on websites such as http://www.audiogames.net/). What happens, then, when a game consisting of sound and graphical elements is played? What does sound provide to this experience? In the following section we discuss a case study that relates to this issue in general and the use of sound as a means of directing the player in particular. According to Ong (1982/90), vision and hearing have a basic bipolarity. The aim of the following paragraph is to discuss the relation between vision and hearing, as well as Ong’s suggested bipolarity of these two and how this relates to immersion. Vision separates us from the environment, making the limits of our bodily containers protrude, whereas sound integrates us with the environment, blurring the border between the container of the self and the adjacent environment. Ong’s theory, which is concerned with the differences between written and spoken language, might seem odd to use in relation to a model for the analysis and production of computer game audio. Nevertheless, we find his remark about this bipolarity highly relevant with regard
110
to understanding the function that sound has in audiovisual constructions such as computer games. Think about it; in order to fully take in an environment through vision we need to move around and turn our eyes towards what we would like to see (cf. Ong, 1982/90; Gibson, 1986). Ong actually refers to the immersive effect that highfidelity audio reproduction accentuates. Sight is limiting, but hearing is not in the same way. We can hear what is behind us and then turn around to see it. If sound integrates rather than separates us from the surrounding environment, it would seem reasonable that sound and immersion have a strong relationship. Integration through sound might lead to immersion. Hearing is, on the other hand, also a selective process. We may, to some extent, filter out uninteresting and disturbing sounds. A construed audiovisual environment is a prefiltered1into which sound and images have been put through the selective processes of their creators. We therefore discuss hearing, vision, the visual, and affordances (Gibson, 1977, 1986) in relation to the sonic environments of computer games. To some extent, the bipolarity of hearing and vision is innate. Biologically, we develop hearing before we can see. A human fetus can normally perceive sound from week 15 after conception and the ears are usually fully developed by week 24. The fetus is surrounded by amniotic fluid and this underwater environment is an immersive one that completely encloses us. We are immersed and can feel touch from week ten. In fact, one of the primary definitions of immersion used in the context of computer games clearly connects the concept of immersion with being under water (Murray, 1997, pp. 98-99). Hearing the environment precedes seeing it, in terms of how these senses develop from conception, and feeling the environment precedes hearing it. In the womb, movement is restricted and, as newborns, we have no locomotion and must be transported by others. Sight is still limited and objects in the visual field need to be very close to be in sharp focus,
A Combined Model for the Structuring of Computer Game Audio
even if the eyes themselves are more or less fully developed from week twenty five. However, we can hear and recognize sound such as the voices of our parents or melodies from a computer game prior to birth and respond to such auditory stimuli by kicking and moving around. It can be asserted that sound activates the fetus. When we grow up, hearing is still physically affective and may induce both conscious and reflexive physical responses. Rapid and loud sounds may be frightening while slow and soft ones may be relaxing. When listening to the long sequence of the breathing sound in the movie 2001: A Space Odyssey (Kubrick, 1968), it is our experience that it is almost impossible not to fall into the same rhythm and breathe in synchronization with the sound of the film. The bottom line is that hearing includes us in the environment, making us part of it rather than separating us from it. Listening is part of feeling immersed and immersion is a perceptual, body based experience. Central to human perception and cognition is the configuration of the human body and its ability to move around within an environment. This is the basis for a number of theories on embodied and situated cognition, that is, how humans make meaning of the environment in which they are situated. We propose that the organization of sound within computer game environments would benefit from some basic insight into cognitive theory and the idea of basic level primacy (Lakoff, 1987; Lakoff & Johnson, 1999). A central claim in Lakoff and Johnson’s work is that the objects we encounter may be understood from a perspective of the superordinate, that is, basic and subordinate levels of which the basic level is the highest one that provides an object with an overall understandable form and a general user pattern. The concept of a chair is, for instance, at the basic level, whereas furniture is at a superordinate level and a specific red and white chair, made of steel and concrete, is at the subordinate level. The more detailed meaning we can assign to any given object the more subordinate it is. We do not have
a common user pattern for furniture, nor could the whole category be described by 1simple and understandable form. For example, a table and a cupboard are both pieces of furniture but they do not look the same or have the same schemata of use. Although there are good reasons to follow Ong’s idea about the bipolarity of vision and hearing, we argue that sound might also be understood from the perspective suggested by Lakoff and Johnson. Furthermore, we constantly observe the environment through points of observation which include all our senses (Gibson, 1986). The human mind and the human body are not primarily separate units but make up complex systems of which the visual and auditory sensory systems are of great importance for our understanding of the world. The configuration of the human body has effects on human perception as well as human cognition. Gibson (1986) suggests a number of different kinds of vision which are based on the situations they are employed in: • • • •
Snapshot vision: fixating a point and then exposing some other point momentarily Aperture vision: successive scanning of the visual stimuli Ambient vision: looking around by turning the head Ambulatory vision: Looking around by moving towards objects.
In a real time strategy (RTS) game like Warcraft III (Blizzard Entertainment, 2002), the player has a larger visual field than the controlled characters she is commanding, enabling an overview of the diegetic environment using 3 of the 4 different types of vision. She can use snapshot vision to fixate a point, aperture vision to perform successive scanning, and ambulatory vision by moving towards objects. If the controlled avatar is turned around, the visual field, as such, does not rotate. This is the common practice in RTS games. The player may, to some extent, change the angle of the visual field which is also somewhat limited
111
A Combined Model for the Structuring of Computer Game Audio
by a highlighted ring. What lies beyond this ring is not visible. In order to see what is hidden, the player must enforce ambulatory vision, that is, move the controlled characters. In an adventure game such as Legend of Zelda (Nintendo, 1987), the player’s view is locked in a specific angle heightwise. The player needs to move the avatar towards the end of the visual field in order to reveal what lies beyond the framing of the diegetic environment, that is, she can use snapshot vision, aperture vision and ambulatory vision. In F.E.A.R. (Monolith Productions, 2005), which is a first-person shooter (FPS), the player, on the other hand, sees the world through the eyes of the avatar.An immobile observer, who cannot change the viewpoint within the game, can only use either snap shot vision or aperture vision within the visual field of the computer game environment. A game such as Myst (Cyan Worlds, 1993) works this way, as does Pac-Man (Namco, 1980), Space Invaders (Taito, 1978) and many other early games. Hence “The single, frozen field of view provides only impoverished information about the world […] The evidence suggests that visual awareness is in fact panoramic and does in fact persist during long acts of locomotion” (Gibson, 1986, p. 2). All the games mentioned above have sound to make the environment more connected to the act of playing and to achieve a more prominent and lifelike game environment. The sonic environment of Myst is quite elaborate for its time, being distributed on CD-ROM which allows considerably more data than earlier games. In addition, more data capacity also meant comparably high audio resolution, that is, bit depth and sample rate. Pac-Man and Space Invaders used other kinds of technology in their original form, relying on the hardware rather than the software but, when ported to other platforms, such as consoles and PCs, they were kept quite close to the original limits soundwise. If sound integrates us in the environment, as Ong (1982/90) proposes, and if sound and immersion are also related, we might also employ different kinds of listening to make the environ-
112
ment meaningful. As mentioned earlier, Chion (1994) suggests 3 different listening modes: causal, semantic and reduced listening. In addition to these, we suggest the following different kinds of listening when playing a computer game, which are analogous to Gibson’s 4 kinds of vision: •
• •
•
Snapshot listening: fixating a point and then shifting to some other point momentarily by filtering out all other sound sources Aperture listening: successive scanning of the audio stimuli Ambient listening: increasing the frequency range of the sound by turning the body towards its source for higher definition of the sound Ambulatory listening: listening by moving around and using sound as part of the navigation within the environment.
In order to support the idea of our 4 listening modes and their relation to Gibson’s 4 modes of seeing, we briefly present a case study called In the Maze, which was a laboratory-based experiment conducted at the InGaMe Lab at the University of Skövde. This discussion also provides interesting connections to our combined model for game audio. The case study was originally devised to investigate whether or not sound can be said to align with Gibson’s (1977, 1986) ideas of affordances and, if so, whether sound stimuli would make certain locomotive patterns more probable than others. The affordances of the environment are what it offers the animal, what it provides or furnishes, either for good or ill […] I mean by it something that refers to both the environment and the animal in a way no existing term does. It implies the complementary of the animal and the environment. (Gibson, 1986, p. 127)
A Combined Model for the Structuring of Computer Game Audio
The game environment for our experiment was a hexagonal structure comprising a labyrinth of corridors that made it impossible to adopt a strategy based on always going left or always going right because that would only lead the test subject back to the starting point. We also tried to create a consistent environment, that is, a level consistent with reality and including sound that would match the visual environment. There were, however, differences in the sound played at specific parts in the corridors leading to intersections. The test subjects played the game wearing headphones. At certain points in the game’s 4 levels, the game was scripted to play 2 different kinds of sounds in the right and left headphone speakers. The player did not at first know when such a scripted sound would occur but could turn back in the corridor and the sounds would be triggered again. That is to say, the first time a scripted sound played, it was not the result of any conscious game strategy formulated by the player. The difference between the sounds was thatonekind of sound was meant to have the semantic value open and the other closed, at a basic level of categorization. In other words, we tried to propose a certain universal user pattern, to walk towards the open sound rather than the closed one. The basic intention of the test was that if a sound in the right speaker was designed to suggest “closed”, the path to the right would lead to a dead end and vice versa. The only instructions the test subjects received were to play a game. By doing so, we introduced the idea that there ought to be some form of rules and ludus element rather than free playing activity, that is, paidia (Caillois, 1958/1961). The hypothesis was that with only rudimentary instructions the test subjects would need to identify the environment as a maze by exploring it using the 4 different types of vision that Gibson suggests, then devise a strategy for moving through the maze. The idea of our 4 modes of listening was not part of the hypothesis but a result deduced from these tests. We collected several layers and types of data for this test:
• •
•
• •
Video recordings of the gameplay session from the players’ perspective Video recordings of the game player from 3 different angles; face on, from the left side and from above Sound recording of the player while playing, for the purpose of capturing spontaneous comments addressed to the game as a system6 Video and audio recordings of semi-structured interviews after test sessions and Video and audio recording of a replay of each player’s session, in which they were given a chance to freely comment upon their own gameplay.
Several test subjects adopted an audio-based game strategy even if many were not actually aware of it. What we can deduce from the data collected is that many of the test subjects tried to follow sound to reach the end of each level in the labyrinth. We did put in a reversed level to examine whether there actually was audio that mattered with regard to choices. That is, 1of the 4 levels had the sounds that signaled open leading to dead ends and vice versa. This level indicates a tendency that test subjects really followed sounds and were confused when the pattern was changed. The data shows that a perceptual/cognitive set (Bugelski & Alampay, 1961; Wilhelmsson, 2001) may be constructed of audio and visual stimuli and that such a perceptual set may lead to the formation of a strategy to reach the end state of a game. We used the game engine of Half-Life (Valve Corporation, 1998) for our case study test. Some of the test subjects identified the game as a HalfLife level very quickly, which in turn and, due to previous experience of Half-Life, led to immediate speculation about what the game would provide. Half-Life is a FPS game based on the principle of Agôn (Caillois, 1958/1961). Test subjects with previous and deep experience of FPS games therefore presumed they would encounter enemies of some kind and quickly adopted a spe-
113
A Combined Model for the Structuring of Computer Game Audio
cific locomotive pattern. When the game began, they rapidly turned around to get an overview, that is, ambient vision; they also tried to watch their back at times and had a tendency to walk in a criss-cross pattern, indicating ambulatory vision. At times, criss-crossing led them to see more of 1corridor, depending on which side they were walking or running when they reached an intersection. That is, if they were keeping to the right, they would see more of the corridor to the left and, in some cases, they preferred to move towards what they could see. Test subjects who had a great deal of experience with Half-Life or other similar first-person shooters, probably also used snapshot vision to rapidly scan the environment. However, we had no means of measuring this and the only way we can observe the use of snapshot vision from our data is how the centre of the first-person perspective fluctuates. It is probably the case that subjects moving rapidly in the environment need to use snapshot vision due to their velocity but further tests would need to be conducted before jumping to conclusions. Furthermore, inexperienced players tended to always walk in the direction suggested by their starting orientation. The ME-FIRST orientation (Lakoff & Johnson, 1980; Wilhelmsson, 2001), as well as the experience of having a body and moving primarily forwards, overshadowed other possibilities of locomotion: Since people typically function in an upright position, see and move frontward, spend most of their time performing actions, and view themselves as being basically good, we have a basis in our experience for viewing ourselves as more UP than DOWN, more FRONT than BACK, more ACTIVE than PASSIVE, more GOOD than BAD. (Lakoff & Johnson, 1980, p. 132) The Game Ego manifestation (Wilhelmsson, 2001) presented the inexperienced player with a direction for walking or running that was not initially questioned. At the same time, the affordance
114
of the walk-ability is at play: You walk straight on because that is what you do in a corridor. “Moving objects generally receive a FRONT–BACK orientation so that the front is in the direction of motion (or in the canonical direction of motion, so that a car backing up retains its front)” (Lakoff & Johnson, 1980, p. 42). We can also conclude that the culture of play and prior familiarity with this kind of game environment had some influence on how the test subjects tried to move the Game Ego manifestation around within the game environment and not only the visual (and sonic) affordances as such. The results of the case study are, in essence, transferable to the suggested combined model and take into account the relation between vision and sound. In fact, the method used in the case study makes a good integral part of the combined model. Some of the test subjects were affected not only by the ambulatory listening but also by the ambulatory visual position within the game environment. It is important to bear this in mind, since most digital games consist of graphics and sound manifesting some kind of environment. Games are actions undertaken by players and these actions may be induced by sound and/or by sound and graphics in combination. The horizontal axis of the original IEZA-framework is the one that categorizes the game audio in terms of setting versus activity and here we have a clear connection between our test and the combined model. The case study provides the material for the analysis of the sound and image relation and the effect of locomotion, that is, it stresses the horizontal axis of the IEZA-framework that differentiates setting and action. The vertical axis of the IEZA-framework categorizes the game audio in terms of diegetic versus non-diegetic. The dynamics of the audiovisual environment and Ong’s (1982/90) suggested bipolarity of vision and hearing can be understood at a deeper level using the combined model that takes into account both the cognitive load on the subject playing the game and the moving around within
A Combined Model for the Structuring of Computer Game Audio
Figure 6. Warcraft III analysis example. Numbers are collected from Table 2 as an imaginable snapshot of sounds. The combined model visualizes the possible cognitive load in the audio layering. The more central a sound is placed, the higher its level of encoding; the more peripheral a sound is, the lower its level of encoding
the game environment, exploring its possibilities in a dynamic flow (Figures 3, 5, 6, 7, and 8).
UsING tHE cOMbINED MODEL tO ANALYZE cOMPUtEr GAMEs In the following section, we provide 3 sample analyses using the combined model. The games used for these analyses are: F.E.A.R. (Monolith Productions, 2005); Warcraft III (Blizzard Entertainment, 2002); and Legend of Zelda (Nintendo, 1987). The aim of the analyses was to find each sound source possible within these games and then estimate the properties of the sounds. Not every single sound is included as it would have taken too long to find them all. Thus, for example, sound 25 in Table 2 “Hero uses magic” represents every sound created when a Hero in the game uses some kind of magic spell. In this way, the results are applicable to the combined model and are presented both as tables (Tables 1, 2, and 3)
and as snapshots of the audio layering within the combined model (Figures 5, 6, and 7). While employing the combined model, we found that the internal emphasis on the different parts of the IEZA-framework vary between different types of games, due to limitations of technology and genre conventions, which in turn depends on the gameplay and the relationship between player and game environment. For example, a first-person shooter or a shoot’em up game will emphasize the Effect sounds, and have fewer Zone sounds. A typical example of this is F.E.A.R. (Monolith Productions, 2005). There are indeed a lot of yellow effect sounds in the game F.E.A.R. (2005). Luckily, not all of these sounds are always played simultaneously. However, sometimes, many are played together, which results in a big wall of sound that is hard to make sense of. This can be used to emphasize chaos and, in the context of an action game, it might well be useful. Nevertheless, that would be its only use. With regard to the analysis, F.E.A.R.
115
A Combined Model for the Structuring of Computer Game Audio
Figure 7. Imaginable snapshot of sounds in Legend of Zelda
also has few interface sounds, probably due to the minimalistic graphical interface. Warcraft III (2002) was chosen for analysis on the basis of the personal pre-understanding that its audio is well balanced and also complete in relation to the visual component of the game. In Figure 6, we used the sounds from Warcraft III to exemplify how to apply the combined model for the analysis of a specific game. Numbers are collected from Table 1 as a possible snapshot of sounds. The combined model visualizes the possible cognitive load in the audio layering. In addition, the model visualizes that there are 3 bass, 4 midrange, and 2 treble sounds. If there are too many sounds in the centre of the model, the cognitive load, in terms of encoded sounds, is higher. Not surprisingly, the midrange sounds are cyan (1 = Cutting wood) and violet (32 = “Our Goldmine has collapsed”). Numbers 12, 17 and 18 are sound events connected to Effect and Activity
116
and all are diegetic sounds. Sound 21 (Crickets) belongs to the Zone and is an ambient sound with a lot of treble. Sound 32 (“Our goldmine has collapsed”) originates from the non-diegetic Affect part of the sonic environment as does sound 43 which is the background music. Our analysis of Legend of Zelda (Nintendo, 1987) was mainly due to curiosity about how earlier games differ in the distribution of audio categories in relation to IEZA and Murch’s conceptual model. The Loudness and Frequency parameters were omitted in the analysis of Legend of Zelda. Due to the technical nature of the system in which Legend of Zelda is played, the number of possible simultaneous sounds is limited to only a few. The dynamic range is also very narrow, therefore all the shapes are sparse and equally sized. The system, as such, does not support spoken language, which is why there are no encoded sounds.
A Combined Model for the Structuring of Computer Game Audio
Table 1. F.E.A.R. analysis F.E.A.R. (2005) Sound Event
State
Diegetic?
IEZA
Color
Origin
Loudness
Frequency Band
1
Weapon reload
In-game
Yes
Effect
Yellow
Character
3
Middle
2
Clothes sound
In-game
Yes
Effect
Yellow
Character
2
Middle
3
Player footsteps
In-game
Yes
Effect
Cyan
Character
2
Middle
4
Enemy footsteps
In-game
Yes
Effect
Cyan
Character
2
Middle
5
Landing after jump
In-game
Yes
Effect
Yellow
Character
3
Middle
6
In-game music
In-game
No
Effect
Red
“Orchestra”
2
Low
7
Enemy gunfire
In-game
Yes
Effect
Cyan
Character
4
Middle
8
Fire gun
In-game
Yes
Effect
Cyan
Character
4
Middle
9
Glass shatter
In-game
Yes
Zone
Yellow
Object
2
High
10
Empty shell bounce
In-game
Yes
Effect
Cyan
Object
2
High
11
Enter slow motion mode
In-game
No
Affect
Orange
“Narrator”
2
Low
12
Exit slow motion mode
In-game
No
Affect
Orange
“Narrator”
2
Low
13
Enemy radio chatter
In-game
Yes
Effect
Violet
Character
3
Middle
14
Friendly radio chatter
In-game
Yes
Effect
Violet
Character
3
Middle
15
Radio noise
In-game
Yes
Effect
Orange
Object
2
Middle
16
Throw grenade
In-game
Yes
Effect
Yellow
Character
1
Middle
17
Grenade bouncing
In-game
Yes
Effect
Cyan
Object
2
High
18
Grenade explosion
In-game
Yes
Effect
Yellow
Object
5
Low
19
Enemy talk
In-game
Yes
Effect
Violet
Character
3
Middle
20
Change weapon
In-game
Yes
Effect
Yellow
Object
3
Middle
21
Enemy dies
In-game
Yes
Effect
Violet
Character
3
Middle
22
Breaking environment
In-game
Yes
Zone
Yellow
Object
2
Middle
23
Pause game
In-game
No
Interface
Orange
“Narrator”
1
High
24
Unpause game
In-game
No
Interface
Orange
“Narrator”
1
High
25
Ghost talking
In-game
Yes
Effect
Violet
Character
2
Middle
26
Picking up weapon
In-game
Yes
Effect
Yellow
Object
2
Middle
27
Picking up grenade
In-game
Yes
Effect
Yellow
Object
2
Middle
29
Throw weapon
In-game
Yes
Effect
Yellow
Object
3
Middle
30
Pick up health booster
In-game
No
Interface
Orange
“Narrator”
3
Middle
31
Pick up reflex booster
In-game
No
Interface
Orange
“Narrator”
3
Middle
32
Pick up medkit
In-game
Yes
Effect
Yellow
Object
2
Middle
33
Using medkit
In-game
Yes
Effect
Orange
objetct
2
Middle
28
Menu music
Menu
No
Affect
Red
“Orchestra”
2
Low
34
Menu selection
Menu
No
Interface
Orange
“Narrator”
1
High
35
Menu accept
Menu
No
Interface
Orange
“Narrator”
1
High
36
Menu go back
Menu
No
Interface
Orange
“Narrator”
1
High
117
A Combined Model for the Structuring of Computer Game Audio
Table 2. Warcraft III analysis Warcraft III Sound Event
State
Diegetic?
IEZA
Color
Origin
Loudness
Frequency Band
1
Cutting wood
In-game
Yes
Effect
Cyan
Character
2
Middle
2
“I can´t build there”
In-game
Yes
Effect
Violet
Character
3
Middle
3
Insufficient recourses
In-game
Yes
Affect
Violet
“Narrator”
3
Middle
4
”Awaiting order”
In-game
Yes
Effect
Violet
Character
3
Middle
5
“Job´s done”
In-game
Yes
Effect
Violet
Character
3
Middle
7
“Accepting order”
In-game
Yes
Effect
Violet
Character
3
Middle
8
New Unit Available
In-game
Yes
Effect
Violet
Character
3
Middle
9
Unit attack order
In-game
Yes
Effect
Violet
Character
3
Middle
10
Click on building
In-game
Yes
Effect
Yellow
Object
2
Low/Middle
11
Building construction
In-game
Yes
Effect
Yellow
Object
2
Low/Middle
12
Goldmine collapse
In-game
Yes
Effect
Yellow
Object
4
Low/Middle
13
Building collapse
In-game
Yes
Effect
Yellow
Object
4
Low/Middle
14
Building on fire
In-game
Yes
Effect
Yellow
Object
2
Middle
15
Building attacked
In-game
Yes
Effect
Yellow
Object
2
Middle
16
Click on “critter”
In-game
Yes
Effect
Yellow
Character
1
Middle
17
Unit Attacked
In-game
Yes
Effect
Yellow
Character
2
Low
18
Falling tree
In-game
Yes
Effect
Yellow
Object
4
Low
19
Singing birds
In-game
Yes
Zone
Yellow
Ambience
1
High
20
Ambient noise
In-game
Yes
Zone
Yellow
Ambience
1
High
21
Crickets
In-game
Yes
Zone
Yellow
Ambience
1
High
22
Frog
In-game
Yes
Zone
Yellow
Ambience
1
Middle
23
Cicadas
In-game
Yes
Zone
Yellow
Ambience
1
High
24
Owl
In-game
Yes
Zone
Yellow
Ambience
1
Middle
25
Hero uses magic
In-game
Yes
Effect
Orange
Character
2
All
26
Unit constructing
In-game
Yes
Effect
Cyan
Character
2
Short
27
Unit dies
In-game
Yes
Effect
Cyan
Character
3
Middle
28
“Victory”
In-game
No
Affect
Violet
“Narrator”
4
Low/Middle
29
“Defeat”
In-game
No
Affect
Violet
“Narrator”
4
Low/Middle
30
“Research complete”
In-game
No
Affect
Violet
“Narrator”
3
Middle
31
“Upgrade complete”
In-game
No
Affect
Violet
“Narrator”
3
Middle
32
“Our goldmine has collapsed”
In-game
No
Affect
Violet
“Narrator”
3
Middle
33
“Our hero has fallen”
In-game
No
Affect
Violet
“Narrator”
3
Middle
34
“Our forces are under attack”
In-game
No
Affect
Violet
“Narrator”
3
Middle
35
Rooster
In-game
No
Affect
Yellow
“Narrator”
2
Middle
36
Wolf Howl
In-game
No
Affect
Yellow
“Narrator”
2
Low
37
Set rally point
In-game
No
Affect
Yellow
“Narrator”
2
Low
38
Set building spot
In-game
No
Affect
Yellow
“Narrator”
2
Low
continued on following page
118
A Combined Model for the Structuring of Computer Game Audio
Table 2. continued Warcraft III IEZA
Color
Origin
39
Unavailable Sound
Sound Event
In-game
State
No
Diegetic?
Interface
Yellow
“Narrator”
3
Loudness
Middle
Frequency Band
40
Click upper GUI
In-game
No
Interface
Orange
“Narrator”
1
Middle
41
Mini map signal
In-game
No
Affect
Orange
“Narrator”
1
High
42
Click lower GUI
In-game
No
Interface
Orange
“Narrator”
1
Middle
43
Background Music
In-game
No
Affect
Red
“Orchestra”
2
Low/Middle
44
Meteor Falling
Menu
Yes
Effect
Yellow
Object
2
High
45
Meteor Impact
Menu
Yes
Effect
Yellow
Object
3
Low
46
Rain
Menu
Yes
Zone
Yellow
Ambience
1
High
47
Thunder
Menu
Yes
Zone
Yellow
Ambience
2
Low
48
Click Menu
Menu
No
Interface
Orange
“Narrator”
2
Middle
49
Menu switch
Menu
No
Interface
Orange
“Narrator”
2
High
50
Menu Music
Menu
No
Affect
Red
“Orchestra”
2
Low/Middle
The 3 sample analyses clearly show that the emphasis on specific parts of the IEZA-framework shifts depending on what genre and what kind of platform the game belongs to. The snapshots made with the combined model show how sounds are clustered and provide a visualization of the sound layering. Figure 5 illustrates how the sounds of F.E.A.R. are mostly only within one quarter of the model: the diegetic activity quarter. A highly paced game, such as this, would probably have most of its sound in this quarter. Figure 6 shows that Warcraft III has sounds in all 4 quarters, with an emphasis on the diegetic activity quarter, while Figure 7 is an example of an older kind of game in which the technical limitations affect how the sonic environment is structured.
UsING tHE cOMbINED MODEL As A PrODUctION tOOLsEt We have declared that the potential loss of control over the sonic environment while producing a computer game is a problem. How can the combined model solve this?
The combined model above (Figure 3) allows the visualization of the sonic environment of a computer game, in terms of cognitive load (Figures 5, 6, 7 and 8). While producing a sonic environment for a game, the different sounds to be used are first categorized, in accordance with the IEZA-framework, and then placed into Murch’s model, as more or less encoded or embodied, which in combination results in our proposed model (Figures 3 to 8). The combined model visualizes the sonic environment of a given environment in a way that makes it possible to see how sound might be clustered; the closer the sounds are to each other the fewer that can be used if clarity of meaning is wanted, that is, a good level of semantic value. The effect will be that the sound designer can see beforehand whether the sonic environment will be biased towards encoded or embodied sound, providing the opportunity to rebalance accordingly. This also balances the frequency spectrum of the sonic environment and distributes the cognitive load in the brain. Even if all possible combinations of sounds cannot be plotted, a sound strategy for limiting these unwanted effects can be adopted by plotting prototypical game events
119
A Combined Model for the Structuring of Computer Game Audio
Table 3. Legend of Zelda analysis Legend of Zelda Sound
State
Diegetic?
IEZA
Color
Origin
1
Enter cave/walking stairs
In-game
Yes
Effect
Cyan
Character
2
Use sword
In-game
Yes
Effect
Yellow
Character
3
Sword shoots
In-game
Yes
Effect
Yellow
Character
4
Enemy takes damage
In-game
Yes
Effect
Yellow
Character
5
Enemy dies
In-game
Yes
Effect
Yellow
Character
6
Open locked door
In-game
Yes
Effect
Cyan
Object
7
Door shuts
In-game
Yes
Effect
Cyan
Object
8
Boomerang
In-game
Yes
Effect
Cyan
Object
9
Boss sound
In-game
Yes
Effect
Yellow
Character
10
Sword useless
In-game
Yes
Effect
Yellow
Character
11
Place bomb
In-game
Yes
Effect
Yellow
Object
12
Bomb explode
In-game
Yes
Effect
Yellow
Object
13
Waves against shoreline
In-game
Yes
Zone
Yellow
Ambience
14
Background music
In-game
No
Affect
Red
“Orchestra”
15
Letters typing
In-game
No
Effect
Cyan
“Narrator”
16
Pick up consumable
In-game
No
Effect
Orange
“Narrator”
17
Pick up quest item
In-game
No
Effect
Orange
“Narrator”
18
Consumable appear
In-game
No
Effect
Orange
“Narrator”
19
Collect money
In-game
No
Effect
Orange
“Narrator”
20
Dungeon music
In-game
No
Affect
Red
“Orchestra”
21
Key appear
In-game
No
Affect
Orange
“Narrator”
22
Collect key
In-game
No
Effect
Orange
“Narrator”
23
Solve a puzzle
In-game
No
Affect
Orange
“Narrator”
24
Take compass
In-game
No
Effect
Orange
“Narrator”
25
Take Map
In-game
No
Effect
Orange
“Narrator”
26
Low health
In-game
No
Affect
Orange
“Narrator”
27
Player character “dies”
In-game
No
Effect
Orange
“Narrator”
28
Switch item
In-game
No
Interface
Orange
“Narrator”
29
Dungeon Complete Music
In-game
No
Affect
Red
“Orchestra”
30
Menu music
Menu
No
Affect
Red
“Orchestra”
31
Menu selection
Menu
No
Interface
Orange
“Narrator”
32
Chose letter
Menu
No
Interface
Orange
“Narrator”
33
Game over music
Menu
No
Affect
Red
“Orchestra”
in accordance with the central gameplay aspects of a given game and a given game genre. One can also use the different sizes of the primitives, put into the combined model (Figure 8), to show
120
the dynamic range, that is, the relative loudness between sounds. It can at first be difficult to see how to utilize the model practically. The model’s present design may well be refined later, but that
A Combined Model for the Structuring of Computer Game Audio
Figure 8. Shoot the Ducks level 1
does not really matter. This is not just a model but also a kind of paradigm or, in other words, a way of thinking about these matters. The key is to be pro-active with regard to sound design and to plan the distribution of sounds before they are even created. In most audio-editing software, the colors of Murch’s original conceptual-model, and our combined model (Figures 2 and 3), may well be used to designate the status of the sound files as more or less encoded. The music track (affect) could be made red, guns and explosions (effect) would be yellow, the ambient sounds, such as birds and so on, (zone) should be orange, and the dialogue (encoded) shall be blue, which is in accordance with Murch’s (1998) conceptual model. This feature of the color encoding of specific sound events is found in many commercial products and may very well be used in this manner while creating a sonic environment for a game or movie
in order to avoid cognitive overload. In fact, the combined model might, in itself, be used as an interface for audio editing software. What then are the benefits of using our combined model in practice? Let us provide an example of creating the sound design for a simple game. We first present the game’s design document. The aim here is not to create a stunning new best selling game, but rather to exemplify how the combined model may be used to plan the sound of the proposed shoot’em up game on the basis of a design document.
shoot the Ducks Design Document Game Objects The game includes twenty objects: 4 ducks, armor for the ducks for each of the levels from 5 to 8, 2 guns, a pond, and a wall. The wall object
121
A Combined Model for the Structuring of Computer Game Audio
Table 4. The sounds from the Shoot the Ducks design document A sound that is played as background music for instructions A sound that is played to indicate game started A sound that is played as background music for level #1 A sound that is played as background music for level #2 A sound that is played as background music for level #3 A sound that is played as background music for level #4 A sound that is played as background music for level #5 A sound that is played as background music for level #6 A sound that is played as background music for level #7 A sound that is played as background music for level #8 A duck sound that is played while the ducks are swimming A duck chatter sound A duck sound that is played when the duck is hit by a shot. A duck sound that is played when a duck dies A bounce sound that is used when the ducks hit a wall object. A gun handling sound that indicates the change from one gun to another gun click A sound that is played when gun #1 is fired A sound that is played when gun #2 is fired A sound that is played when a wall is hit by a shot A sound that is played as background music end titles A sound that is played to indicate that the player has reached high score A sound that is looped that contains ambience sound A sound that is played to indicate a score change A sound that is played to indicate a change of level
has a grass-like image, while the playing area, the pond is surrounded by the grass-like objects. From levels 4 to 8, the pond has small islands of wall objects behind which the ducks can seek shelter (they are programmed to do so). The ducks swim in the pond which is made of water like images. The only function of the wall object is to stop the ducks from moving out of the pond. The duck object, which has the image of a duck, moves with varying speed. Whenever it hits a wall object it bounces. The player can shoot through the walls on levels 4 to 8, but the effect of the shots decreases. Whenever the player hits a duck the score increases 10 points. The duck’s speed increases slightly when it jumps to a random
122
place. The gun object is placed at the bottom of the screen, where it is fixed and can only move to the left and right in a half circular pattern by keyboard commands (See Controls).
Sounds We use 24 sounds in this game, covering all the categories from the IEZA-framework as well as the span from encoded to embodied sounds (see Table 4). In addition, the sounds are spread across the action versus setting axis and the diegetic versus non-diegetic axis.
A Combined Model for the Structuring of Computer Game Audio
Controls Both mouse and keyboard control the game. The mouse must have left and right buttons and a scroll wheel. In order to aim the gun, A and D on the keyboard are pressed. The A button moves the gun barrel towards the left and D moves the gun barrel towards the right. The gun is fired with a click of the left mouse button, and using the mouse wheel causes a gun change from one to another.
Game Flow The game starts with the instructions, background music #1 plays and the game begins when the player presses the on-screen start button. This shows the room with the swimming ducks. The game ends when the player presses the <Esc> key.
Scores At the start of the game the score is set at 0. The number of hits a duck can take before dying depends on the following factors: • • • • •
Distance from the gun Angle of the shot Area hit by the shot Whether the shot has passed a grass island The armor value of the duck.
Levels The game has 8 levels. The difficulty of the game increases because the initial speed of the ducks increases after each level and they are given armor from levels 4 to 8. The pond also has small islands of grass behind which the ducks can seek shelter from levels 4 to 8. The game ends when the player has killed all the ducks on all the levels or when the player presses <Esc>.
Utilizing the combined Model in a Game Design Document The relationship between game design and sound design obviously depends on the complexity of the game. In this case we mainly focused on how our model can be implemented in the sound design process. We first categorized each sound and determined its position in our combined model (Table 5). Categorizing the sounds in relation to our combined model provides us with a sense of how the sonic environment will be balanced. For the sake of consistency, we chose to use a table that is similar to the analysis examples mentioned earlier (Tables 1, 2 and 3). Furthermore, in order to keep it simple, we clustered all the similar sounds into groups, for example, all the music sounds do not have to be specified as single sound events. They will not be played simultaneously under any circumstances, if the game runs as intended. Instead of 10 music sounds we only added one. If we group similar sound events, we do not have to think about sound variations at this stage. We also used this kind of sound grouping in the analyses mentioned earlier in this chapter. However, you might then ask why the Warcraft III analysis (Table 2 and Figure 6) has a lot of quotations in its table whereas the F.E.A.R. table (Table 1 and Figure 5) does not. This is simply because all the quotations from Warcraft III originate from essential sound events that are important for the gameplay. In the F.E.A.R. analysis, we chose to group the sounds of characters because they have sufficient similarities to constitute 1single group of sounds. By carefully planning the game audio with two of Chion’s suggested listening modes in mind (causal listening and semantic listening), the sound designer can group the different sounds in relation to their cause and meaning. Furthermore, if the sound designer emphasizes the basic level of categorization, the game audio, as such, will suggest what the player is supposed to do or provide feedback about what the player has done.
123
A Combined Model for the Structuring of Computer Game Audio
In the following review of a few sound events, we explain the process using our combined model as a production toolset. We begin with looking at quite a tricky sound event: the sound omitted when a duck is swimming. To simplify matters, we quickly decided to use a swimming sound, some kind of movement through water. It is first necessary to determine if a sound for this event would take place in the gameworld. In other words, should it be considered diegetic or non-diegetic? Indeed, it must be diegetic since the ducks live and move within the game’s environment of which the water is very much a part. After deciding that the sound is diegetic, we looked at where it belongs in the IEZA-framework. With regard to the IEZAframework, we have two options: A diegetic sound can either be a zone or an effect sound. This is where it becomes tricky if we do not pause for a second. The outcome depends on whether the sound is omitted, due to player-induced activity, or if it is an integral part of the game’s setting. It should be remembered that in the IEZA-framework, a sound, in order to qualify as being the result of activity, has to be directly or indirectly triggered by the actions of the player. Since this sound event is not triggered by the player, either directly or indirectly, we categorized it as a zone sound. The presence of the swimming sound is to enhance the environment, as well as sustain the presence of water in the pond and the motion of the ducks. Our last step was to determine the position of the sound in Murch’s conceptual model. The sound has a kind of rhythmic effect which neatly suits the description of the color cyan. The sound of swimming can also have the semantic value of movement, which is the category of cyan sounds. The music sound was the next to be categorized. The design document did not mention any in-game orchestra that plays music and we therefore categorized the sound as non-diegetic. This gave us 2 options: Did the sound depend on activity or setting (in other words, was it an affect or interface sound)? We chose affect. Referring to
124
Murch’s (1998) conceptual model, music sounds are red and embodied. We now categorized an interesting sound event which allowed us to make an active decision to minimize the cognitive load. According to Murch, too much dialogue or, more specifically, too much spoken language which is encoded, will make the sonic environment dense and avoiding an excess of spoken language means keeping the sound design clear. In this trivial example of just a few sounds, a cognitive overload is obviously unlikely, but it is, however, good practice to think about. The sound for a level change could utilize some sort of announcer using a voice to inform the player of the level change. If we, for example, choose to use an encouraging musical effect instead, the result would be quite different. A short encouraging musical effect is non-diegetic, so the sound was either affect or interface. Since it belonged to the game system, it was therefore an interface sound and given the color orange, which is the category for musical effects and embodied sounds. This game had no encoded sounds (violet) because we avoided including dialogue and spoken language (see Table 5). As the snapshot (Figure 8) of the combined model below illustrates, the sounds at this level of the game are spread evenly across the dynamic range and the frequency range of the sonic environment. This example shows how the combined model may function as a production toolset. A sound designer can, of course, make a quick pencil drawing based on the game design document. The use of a computer to begin the planning of the game audio is not needed. Making the model easy to draw with pencil and paper is beneficial for the sound designer. She can be part of the production process as soon as there is a design document or, in fact, in the initial discussions before a design document exists. The purpose of the combined model is to be a rapid but, at the same time, very structured way of planning the sonic environment of a game. Working with the model is meant to be
A Combined Model for the Structuring of Computer Game Audio
Table 5. Shoot the Ducks sounds categorized Shoot the Ducks Sound Event
Comment
Diegetic?
IEZA
Color
Loudness
Frequency Band
Game started
Beep
No
Interface
Orange
3
Middle
Music
Percussion
No
Affect
Red
2
Low
Duck swimming
Swimming sound
Yes
Zone
Cyan
1
Middle
Yes
Effect
Cyan
2
Middle
Yes
Effect
Yellow
3
Middle
Yes
Effect
Yellow
4
Middle
Duck chatter Duck hit
Duck ”scream”
Duck dies Duck bounce off wall
Cartoony bounce
Yes
Zone
Orange
2
Low
Change weapon
Click
Yes
Effect
Yellow
2
Low
Gun fired
Yes
Effect
Yellow
5
Low
Bullet through wall
Yes
Effect
Yellow
2
Middle
Ambience
Tree whisper and birds
Yes
Zone
Yellow
1
High
Score change
Beep
No
Interface
Orange
3
High
Level change
Short musical effect
No
Interface
Orange
3
Middle
As Table 5 above indicates, there is an emphasis on the sounds of effects in the game, which is due to the shooter genre as such. A shooter needs the sounds of effects as the main audio feedback for the player. After all, as a shooter you are supposed to shoot things. You would also expect sounds from weapons as well as those designating hits or missed shots, the handling of different weapons and maybe some big explosions. The internal order of the IEZA-framework would rather be EZIA in this example of Shoot the Ducks.
an easy process that leads to thinking about sound in a diversified way, providing density and clarity, avoiding a logjam of sound, unwanted sonic artifacts as well as a clear cut visualization that is communicable to other members of a game development team. We have deliberately tried to minimize the terminology within the combined model in order to make it comprehensible for team members other than the sound designer.
cONcLUsION As this chapter has shown, audio in computer games is a complex matter the understanding of which could be made easier using the suggested combined model for game audio. A summary of the problems addressed in this chapter and our solutions to these problems follow.
•
The general lack of functional models for analyzing computer game audio
In order to solve the first problem, we have provided a functional model for the analysis of existing sonic environments in computer games and movies. The combined model covers the dynamic range of the sounds in relation to each other and the frequency range occupied by the different sounds. Encoded sounds, primarily speech, have a natural position in the human frequency response curve. Our work has been anchored in theories spanning from linguistics to semiotics and from film theory to theories of cognition. We have found that different genres of games have different emphases on which types of sound dominate the sonic environment. This might be caused by the choice of technology or by the genre as such.
125
A Combined Model for the Structuring of Computer Game Audio
A shooter needs more effect sounds than a role playing game, for example. We have briefly presented a case study, In The Maze, to show how sound might be part of a game playing strategy, despite the results of the case study also supporting the idea that prior experience and anticipation, concerning the game’s content, affect the locomotive patterns in a game, as also the visual field does. Ong’s statement that sight separates us whereas sound integrates us with the environment has been discussed and put in relation to the models for sound suggested by Huiberts and van Tol as well as by Murch. •
The general lack of functional models for the production of game audio
In this chapter, we have attempted to put forth a model for the production of the sonic environments of computer games. We have shown how the sonic environment of computer games (and movies) may be planned to avoid cognitive overload as well as unwanted interference, by using a model that combines Huiberts and van Tol’s (2008) IEZA-framework for computer game audio and Murch’s (1998) conceptual model for film sound. The loss of control a sound designer has over the playback of the audio in the gameplay of a complex game may lead to a chaotic blur of sounds causing them to lose their definition and thereby their semantic value. •
•
When 2 or more sounds are played simultaneously, the clarity of the mix depends on the type of sounds, which leads to The nature of the relationship between encoded and embodied sounds
The sound designer has some, though limited, control of the sonic environment in a game. To avoid a blurred sonic environment, it will be necessary to define the sound as much as possible. The combined model gives the sound designer an overview of the sonic environment that structures
126
the sound to avoid cognitive overload, supports density and clarity, diversifies the sounds in the 4 basic categories of Interface, Effect, Zone, and Affect sounds, as well as the setting versus activity axis. It also allows the sound designer to distinguish between diegetic and non-diegetic sounds as well as embodied versus encoded ones. The structure of the combined model provides an overview that enables the clustering of encoded and embodied sounds to be visualized in order to help the sound designer plan the production. Furthermore, the combined model establishes a common ground of terminology that is communicable in a dialogue between the sound designer, the game designer, the game writer, the graphical artist and the programmer. The combined model will need further refinement and might have the potential to function as an interface for sound design software. However, even a small step for a sound designer, such as this, might serve as a good starting point for how to plan and analyze the sonic environments of computer games.
rEFErENcEs Bordwell, D., & Thompson, K. (1994). Film history: An introduction. New York: McGraw-Hill. Bordwell, D., & Thompson, K. (2001). Film art: An introduction. New York: McGraw-Hill. Bugelski, B. R., & Alampay, D. A. (1961). The role of frequency in developing perceptual sets. Canadian Journal of Psychology, 15(4), 201–211. doi:10.1037/h0083443 Cancellaro, J. (2006). Exploring sound design for interactive media. Clifton Park, NY: Thomson Delmar Learning. Childs, G. W. (2007). Creating music and sound for games. Boston, MA: Thomson Course Technology.
A Combined Model for the Structuring of Computer Game Audio
Chion, M. (1994). Audio-vision: Sound on screen. New York, Colombia: University Press. Coppola, F. F. (Director). (1979). Apocalypse now! [Motion picture]. Hollywood, CA: Paramount Pictures. Cunningham, S., Grout, V., & Picking, R. (2011). Emotion, content and context in sound and music . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Droumeva, M. (2011). An acoustic communication framework for game sound – Fidelity, verisimilitude, ecology . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Ekman, I. (2008). Comment on the IEZA: A framework for game audio. Gamasutra. Retrieved January 13, 2010, from http://www.gamasutra. com/view/feature/3509/ieza_a_framework_for_ game_audio.php Farnell, A. (2011). Behaviour, structure and causality in procedural audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. F.E.A.R. (2005). Vivendi Universal Games. Monolith Productions. Gibson, J. (1977). The theory of affordances . In Shaw, R. E., & Bransford, J. (Eds.), Perceiving, acting and knowing (pp. ##-##). New Jersey: LEA. Gibson, J. (1986). The ecological approach to visual perception. New Jersey: LEA. Howard, D. M., & Angus, J. (1996). Acoustics and psychoacoustics. Oxford: Focal Press.
Hug, D. (2011). New wine in new skins: Sketching the future of game sound design . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Huiberts, S., & van Tol, R. (2008). IEZA: A framework for game audio. Gamasutra. Retrieved October 13, 2008, from http://www.gamasutra. com/view/feature/3509/ieza_a_framework_for_ game_audio.php Jørgensen, K. (2011). Time for new terminology? Diegetic and non-diegetic sounds in computer games revisited . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Kubelka, P. (1998). Talk on Unsere Afrika Reise. Presented at The School of Sound, London, England. Kubrick, S. (Director). (1968). 2001: A space odyssey [Motion picture]. Location: MetroGoldwyn-Mayer. Lakoff, G. (1987). Women, fire and dangerous things. Chicago: University of Chicago Press. Lakoff, G., & Johnson, M. (1980). Metaphors we live by. Chicago: University of Chicago Press. Lakoff, G., & Johnson, M. (1999). Philosophy in the flesh. New York: Basic Books. Legend of Zelda. (1987). Nintendo. Loftus, G. R., & Loftus, E. F. (1983). Mind at play. New York: Basic Books. Marks, A. (2001). The complete guide to game audio. Location: CMP.
127
A Combined Model for the Structuring of Computer Game Audio
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Originally published in The Psychological Review (1956), 63, 81-97. (Reproduced, with the author’s permission, by Stephen Malinowski). Retrieved March 10, 2009, from http://www.musanim.com/miller1956/ Mullan, E. (2011). Physical modelling for sound synthesis . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global.
Pudovkin, V. I. (1985). Asynchronism as a principle of sound film . In Weis, E., & Belton, J. (Eds.), Film sound: Theory and practice (pp. ##-##). New York: Columbia University Press. (Original work published 1929) Sjöström, V. (1921). The phantom chariot. Svensk Filmindustri. Sobchack, V., & Sobchack, T. (1980). An introduction to film. Boston, MA: Little Brown. Space Invaders. (1978). Taito.
Murch, W. (1998). Dense clarity – Clear density. Retrieved March 10, 2009, from http://www.ps1. org/cut/volume/murch.html
Thom, R. (1999). Designing a movie for sound. Retrieved July 7, 2009, from http://filmsound.org/ articles/designing_for_sound.htm
Murray, J. (1997). Hamlet on the holodeck: The future of narrative in cyberspace. Cambridge, MA: MIT Press.
Valve Corporation. (1998). Half-Life [computer game]. Sierra Entertainment.
Myst. (1993). Brøderbund. Ong, W. (1982/1990). Orality and literacy: The technologizing of the word (L. Fyhr, G.D. Hansson & L. Perme Swedish Trans.). Göteborg, Sweden: Anthropos. Pac-Man. (1980). Namco. Pollack, I. (1952). The information of elementary auditory displays. The Journal of the Acoustical Society of America, 24, 745–749. doi:10.1121/1.1906969 Pollack, I. (1953). The information of elementary auditory displays II. The Journal of the Acoustical Society of America, 25, 765–769. doi:10.1121/1.1907173 Prince, R. (1996). Tricks and techniques for sound effect design. CGDC. Retrieved October 10, 2008, from http://www.gamasutra.com/features/ sound_and_music/081997/sound_effect.htm
128
Wallén, J. (2008). Från smet till klarhet. Unpublished bachelor’s thesis. University of Skövde, Country. Retrieved month day, year, from http://his.diva-portal.org/smash/record. jsf?searchId=1&pid=diva2:2429 Warcraft III. (2002). Blizzard Entertainment. White, G. (2008). Comment on the IEZA: A framework for game audio. Retrieved January 13, 2010, from http://www.gamasutra.com/view/ feature/3509/ieza_a_framework_for_game_audio.php Wilhelmsson, U. (2001). Enacting the point of being. Computer games, interaction and film theory. Unpublished doctoral dissertation. University of Copenhagen, Country.
ADDItIONAL rEADING Adams, E., & Rollings, A. (2007). Game design and development. Saddle River, NJ: PearsonPrentice-Hall.
A Combined Model for the Structuring of Computer Game Audio
Alexander, B. (2005). Audio for games: Planning, process, and production. Berkeley: New Riders. Alexander, L. (2008). Does survival horror really still exist? Retrieved March 12, 2009, http://kotaku.com/5056008/does-survival-horror-reallystill-exist. Branigan, E. (1992). Narrative comprehension and film. London, New York: Routledge. Collins, K. (2007). An introduction to the participatory and non-Linear aspects of video games audio . In Hawkins, S., & Richardson, J. (Eds.), Essays on sound and vision. Helsinki: Helsinki University Press. Cousins, M. (1996). Designing sound for Apocalypse Now. In J. Boorman & W. Donohue, Projections 6: Film-makers on film-making (pp. 149-162). Location: Publisher. Grimshaw, M., & Schott, G. (2007). Situating gaming as a sonic experience: The acoustic ecology of first person shooters, In Proceedings of DiGRA 2007: Situated Play. Huizinga, J. (1955). Homo ludens: A study of the play element in culture. Boston: Beacon Press. Jørgensen, K. (2006). On the functional aspects of computer game audio. In Proceedings of the Audio Mostly Conference. Jørgensen, K. (2007). ‘What are these grunts and growls over there?’ Computer game audio and player action. Unpublished doctoral dissertation. Copenhagen University, Country. Jørgensen, K. (2008). Audio and gameplay: An analysis of PvP pattlegrounds in World of Warcraft. GameStudies, 8(2). Juul, J. (2005). Half-real. Video games between real rules and fictional worlds. Cambridge, MA: MIT Press.
Katz, J. (1997). Walter Murch in conversation with Joy Katz. PARNASSUS Poetry in Review: The Movie Issue, 22, 124-153. Klevjer, R. (2007). What is the avatar? Fiction and embodiment in avatar-based singleplayer computer games. Unpublished doctoral dissertation. University of Bergen, Country. Murch, W. (2000). Stretching sound to help the mind see. Retrieved January 25, 2010, from http:// www.filmsound.org/murch/stretching.htm. Neale, S. (2000). Genre and Hollywood. New York: Routledge. Perron, B. (2004). Sign of a threat: The effects of warning systems in survival horror games. In . Proceedings of COSIGN, 2004, 132–141. Salen, K., & Zimmermann, E. (2004). Rules of play. Game design fundamentals. Cambridge, MA: MIT Press. Stockburger, A. (2003). The game environment from an auditive perspective . In Proceedings of DiGRA 2003. Level Up. Taylor, L. (2005). Toward a spatial practice in video games. Gamology. Retrieved June 23, 2007, from http://www.gameology.org/node/809. Whalen, Z. (2004). Play along: An approach to video game music. GameStudies, 4(1).
KEY tErMs AND DEFINItIONs Affordance Theory: A theory put forth by James J Gibson (1977 and 1986). An affordance is what an environment provides an animal. A path in a wood is “walk-able”, a chair is “sit-able” etc. Ambient Listening: increasing the frequency range of the sound by turning the body towards its source for higher definition of the sound.
129
A Combined Model for the Structuring of Computer Game Audio
Ambulatory Listening: listening by moving around and using sound as part of the navigation within the environment. Aperture Listening: successive scanning of the audio stimuli. Combined Model for the Structuring Computer Game Audio: The model suggested in this chapter that combines the IEZA-framework with Murch’s conceptual model. IEZA-Framework: A framework suggested by Sander Huiberts and Richard van Tol 2008. The IEZA-framework distinguishes between sounds that belong to: the Interface (I), the Effects (E), the Zone (Z) and the Affects (A) in a computer game. Murch’s Conceptual Model: A model for the production of film sound put forth by sound designer Walter Murch 1998. The conceptual framework spans from encoded sound (language) to embodied sound (music). It also suggests that in order to obtain density and clarity of a sound mix the sound designer should limit the amount of sound layers to five separate layers. Snapshot Listening: Fixating a point and then shifting to some other point momentarily by filtering out all other sound sources.
130
ENDNOtEs 1
2
3
4
5
6
In action movies there has been, and still is, a tendency to equate loud with good (Thom, 1999). Violent explosions, big loud weapons and roars of wild engines fill the sonic environment in far too many action movies produced in the last two decades. The interested reader is referred to a good textbook on acoustics and psychoacoustics such as Howard and Angus (1996). The original article on the IEZA-framework has been criticized for not considering previous work (Ekman, 2008; White, 2008). Which is of course the case also with the original IEZA-framework and Murch’s conceptual model but the level of detail is significantly higher in the combined model. Of course further research into this would be necessary to put forth more solid cognitive ground for these primitives but we do believe that they fill their purpose for our combined model. As Loftus &Loftus (1983) have shown, players may occasionally try to talk to the system in acts of personification of the system.
131
Chapter 7
An Acoustic Communication Framework for Game Sound: Fidelity, Verisimilitude, Ecology Milena Droumeva Simon Fraser University, Canada
AbstrAct This chapter explores how notions of fidelity and verisimilitude manifest historically both as global cultural conventions of media and technology, as well as more specifically as design goals in the production of sound in games. By exploring these two perspectives on acoustic realism through the acoustic communication framework with its focus on patterns of listening over time, acoustic communities, and ecology, I hope to offer a model for future theorizing and exploration of game sound and a lens for indepth analysis of specific game titles. As a novel contribution, this chapter offers a set of listening modes that are derived from and describe attentional stances towards historically diverse game soundscapes in the hopes that we may use these to not only identify but also evaluate the relationship between gaming and culture.
INtrODUctION Within game studies—a relatively young discipline itself—the field of game sound has already experienced growth, however there are still scarce resources and analytical frameworks for understanding the role of sound for purposes of cultural critique, historical analysis or crossmedia examination. Frameworks such as the IEZA one (Huiberts & van Tol, 2008; Wilhelmsson & DOI: 10.4018/978-1-61692-828-5.ch007
Wallén, 2011), which builds on several existing design guideline systems for game sound (Ekman, 2005; Grimshaw & Schott, 2007; Jørgensen, 2006; Stockburger, 2007), and particularly Grimshaw’s (2008) conceptualization of an acoustic ecology in first-person shooter games are beginning to pave the way for more in-depth explorations into understanding, analyzing and representing the role of sound in games. In addition to the more established foundations of game sound in music synthesis, algorithmic sound generation, and real-time implementation of
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
An Acoustic Communication Framework for Game Sound
sound effects (Brandon, 2004; Collins, 2007; Friberg & Gärdenfors, 2004; Roeber, Deutschmann, & Masuch, 2006), there is a need for building more general theoretical and analytical frameworks to describe the various elements of game sound and their role within the game’s designed soundscape and its informational ecology. Examples of rich theoretical works on game sound are still few (Collins, 2008; Grimshaw, 2008). I would like to propose a framework for studying game sound that engenders a multi-disciplinary perspective with a specific focus on listening as a dynamically developing, socio-cultural activity influenced by and influencing cultural production and experience. This framework, based on the acoustic communication model developed by Barry Truax (2001) and inspired by R. Murray Schafer (1977) combines media histories with the current technological and cultural reality and takes a critical analytical stance towards discussing the way media shapes our world. Delivering a full history of any game sound predecessors and tracing critical, socio-cultural perspectives of every game genre in existence is not only an ambitious task, but is one that has been done in parts by both scholars and game writers (Collins, 2008; McDonald, 2008). Instead, I will focus on two particular aspects of game sound—fidelity and verisimilitude—and situate them within the interdisciplinary framework of analysis that the acoustic communication model offers. They are two sides of the same idea representing notions of realism or reality in game soundscapes. They reflect long-standing cultural ideals and production values whose histories transgress radio, cinema, and real-world environments. By juxtaposing the two ideas in this manner I hope to elucidate qualities and features of game sound both in a richer way and within a socio-historical discursive context. Fidelity reflects the development of sound in games from a technological perspective while verisimilitude reflects the cultural emergence of authenticity, immersion and suspension of disbelief in cinema,
132
and characterizes the magic flow state in games. Finally, I’d like to connect both these ideas to acoustic ecology and particularly to the concept of acoustic community, which includes the real situation of a player’s own acoustic soundscape in addition to the game’s sonic environment, interlaced in a complex ecology.
tHE AcOUstIc cOMMUNIcAtION MODEL: bAcKGrOUND AND rELEVANcE tO GAME sOUND The concept of acoustic communication articulated by Truax (2001) is a framework that attempts to bring multi-disciplinary perspectives into the study of sound reception as well as sound production and that provides a structure for analyzing and understanding the role of sound in contemporary culture, in media, and in technology. Its roots lie in the tradition of acoustic ecology that was the basis of Schafer’s work in the late 1960s and 1970s: work that is already referenced by several authors (Grimshaw, 2008; Hug, 2011). The following history helps contextualize and focus the particular perspective that acoustic communication has taken on. A pioneer in the field of acoustic ecology, Schafer first defined the notion of a soundscape to mean a holistic system of sound events constituting an acoustic environment and functioning in an ecologically balanced, sustainable way (Schafer, 1977). Born out of the threat of urban noise pollution, Schafer focused on conceptualizing and advocating an ecological balance in the acoustic realm. He developed the terms hi-fi and lo-fi to describe different states of aural stasis in the environment. A hi-fi soundscape, exemplified in Schafer’s view by the natural environment, is one where frequencies occupy their own spectral niches and are heard distinctly, thus creating a high signal-to-noise ratio. A lo-fi soundscape, on the other hand, often exemplified by modern urban city settings, is one where amplified sound,
An Acoustic Communication Framework for Game Sound
traffic, and white noise mask other sound signals and obstruct clear aural communication, creating a low signal-to-noise ratio (Truax, 2001, p. 23). Following Schafer’s work, Truax developed a multi-disciplinary framework for understanding sound based on notions of acoustic ecology as well as communication theory. This framework models sound, listener and environment in a holistic interconnected system, where the soundscape mediates a two-way relationship between listener and environment (Truax, 2001, p. 12). It also places importance on the role of context in the process of listening, emphasizing the listener’s ability to extract meaningful information from the content, qualities, and structure of the sound precisely by situating this process in their knowledge and familiarity with the context and environment (p. 12). Yet Truax also recognizes listening as a product of cultural and technological advances, subject to macro shifts and patterns over time. Such a multi-disciplinary understanding of sound allows us to bring socio-cultural considerations into the soundscape paradigm alongside auditory perception and cognition. Traditional models of auditory perception conceptualize listening as a process of neural transmission of incoming vibrations to the brain (Cook, 1999) that, shaped by our physiology, allows us to experience sound qualities. In fact, as pointed out by Truax (2001) and others, listening is a complex activity involving multi-level and dynamically shifting attention, as well as higher cognitive functions (inevitably dependent on context) such as memory associations, template matching, and foregrounding and backgrounding of sound (p. 11). Again, this model points to the importance of understanding listening as a physiological as well as a cultural and social practice. From a design perspective, it is also imperative to understand that listening is a dynamic and fluid activity that in turn affects the perception and experience of sounds in the acoustic or electroacoustic environment and helps mediate the relationship between actor, activity, context
and environment. Two major classifications of listening are everyday listening as put forward by Gaver (1994, p. 426) —an omni-directional, semi-distracted, adaptive-interactive listening that focuses on immediate information-processing of sound–and analytic listening (Truax, 2001, p. 163) —listening that has attention to detail and which is an expert activity focused on an aesthetic or analytical experience of sound that is rooted in context as its frame of reference for the extraction of information from sound characteristics. Based on the idea of different classifications of listening, Truax developed a number of categories exemplifying major listening modes and processes (pp. 21-27): see Table 1. Clearly, this ontology of listening needs a significant degree of modification in order to fit the complexities of listening in gameplay contexts, and we will continue returning, adding to, and re-conceptualizing the idea of listening positions with regard to game soundscapes. This set of listening types is simply a beginning, allowing us a way to access the historical evolution of listening stances as media, technology, and design have changed. These types of listening, as part of the acoustic communication framework, directly represent macro shifts in the historical and cultural reality of acoustic, electroacoustic, and media listening, and, as an extension, game listening. In analyzing game sound then, this set of listening attentions is to be amended in a similar fashion to uncover and elucidate macro shifts directly procured by the socio-historical experience of sound in games. The notions of fidelity, verisimilitude, and ecology are a particular choice too, yet the concept and drive towards realism is one that I see as not only one aspect of game design and game culture but a more symbolic movement intersecting many media genres and technologies. Rather than simply a design requirement, it is an ideology of contemporary mediated expressions. Examples span from immersive cinematic soundscapes for the big screen and surround sound aesthetics
133
An Acoustic Communication Framework for Game Sound
Table 1. List of Listening Positions from the Acoustic Communication framework (Truax, 2001) Listening Positions
Description
Listening-in-search
Active attentional and purposeful listening, a questing out towards a sound source or soundscape. Sometimes listening-in-search involves a determined seeking of a particular sound template in an aurally busy environment. The cocktail party effect, for example, is a special mode of listening-in-search, which involves a zooming in on a particular sound source—often semantic-based (speech) and familiar in an environment of competing sound information in the same spectrum (Truax, 2001, p. 22).
Listening-in-readiness
Listening-in-readiness involves background listening with an underlying expectation for a particular sound or set of sound signals (such as a baby’s cry). It is a sub-attentional listening in expectation of a familiar sound or signal, a latent alertness.
Background Listening
A non-attentional listening, a receptive stance without a conscious attention or interpretation of sounds or soundscape heard.
Media Listening
An adaptation of media’s flow of perceptual and attentional cues as delivered through sound. Media listening and distracted listening are two positions of listening that Truax (2001) argues are a direct result of the transition to electroacoustic sound and especially the way in which sound has evolved in its use in media. Since much of media is experienced as a background to life, often in the visual background, programming flow has developed sophisticated and strong aural cues in order to manage and direct listeners’ attention to the next item on the media program.
Analytical Listening
A focused, critical expert listening to particular qualities of electroacoustic sounds and recordings.
taking the viewer into a powerful suspension of disbelief, to complete virtual reality, ambient intelligent environments, and computer-augmented physical spaces which have become the norm for contemporary museums and art galleries. There is also the ever-so-popular genre of reality TV, which has reared and acculturated a version of society of the spectacle generation of audiences.
FIDELItY Literally, fidelity means faithfulness. In relation to sound, fidelity signifies the accuracy and quality of sound reproduction, that is, the degree to which an electroacoustic iteration faithfully represents the original acoustic source. From there, the notions of hi-fi (high fidelity) and lo-fi have emerged and are now commonly applied to refer to quality of audio equipment, specific recordings and (cinematic) listening experiences. As noted in the previous section, Schafer (1977) also utilized these two distinctions of fidelity, except he applied them to refer to a soundscape’s ecological balance in terms of a signal-to-noise ratio. In this section, we’ll focus on fidelity as a concept representing
134
the move from abstract musical chiptunes (8-bit synthetic tunes) to realistic sampled sounds in the design of game soundscapes. Fidelity here will exemplify the technological changes in game sound’s realism.
role in Game sound: socio-cultural History In tracing some of the history of game sound, Stephen Deutch (2003) makes a convincing point about the trajectories that sound for games has taken historically. As he points out, the first game sound designers were essentially musicians and/or experimental composers (p. 31). In that, historically there was a split between those who followed Pierre Schaffeur’s musique concrète tradition and those who were interested in electronic music. The second group ended up getting involved in game sound production and laying the foundations of contemporary game sound. The way in which this fact concerns fidelity is that while musique concrète works with sampled sound—that is, real acoustic sources—as material for sonic expressions, electronic musicians were fascinated with the purely abstract world of the synthesizer and
An Acoustic Communication Framework for Game Sound
the completely un-real soundscapes it produced. From here we have the tradition of chiptunes: 8-bit synth tunes encoded directly on the microchip of the game console. Initially, of course, space and memory were some of the pragmatic issues driving the minimalistic and synth-based soundscapes in games. With technological improvements, such constraints are no longer relevant, however the demographic of game sound practitioners still exerts a formative role not on what is possible but on what is realized in game sound today and how associations between sounds and their meanings in a game become forged. As Deutch puts it, even though game sound emulates film sound in its “filmic reality” of representation, it is often too literal—“sound effects as opposed to sound design” (p. 31) —see Figure 1. Invoking what Schafer (1977) might call the listener as composer, many games today utilize adaptive-interactive audio, that is, each player
constructs her own unique soundscape by moving and interacting with their avatar. Yet even then sound effects are “loopy”: they often come from generic sound banks, (see Figure 3) and are exactly the same each time they sound, sometimes getting cut off if the player’s actions are faster than the sound file’s duration. They get called up and filtered according to the spatial/contextual demands of the character’s progression, however, it is only in high-end games, typically in firstperson shooters (FPS) where the richness of a complex soundscape really comes through with 3D audio rendering and spatialization (Grimshaw, 2008) to account for acoustic coloration and atmospheric variables. FPS games afford the player the unique position of literally listening with the character’s ears since the game presupposes the player is that character. Any other POV (point of view) character stance by definition distances the player from the soundscape, making
Figure 1. Note the compressed, repetitive nature of the waveform, reflecting synthetic strings of sounds, often separated by little sine tone clicks and artificial silences
135
An Acoustic Communication Framework for Game Sound
Figure 2. Historical and cross-genre, cross-platform example of game soundscapes. In the first two examples we see a progression from 8-bit sound to polyphonic synthesized sound, while Fallout 3 reflects a 3D spatialized environment of varied and large dynamic range (highs and lows) to avoid masking and maximize clarity; finally God of War features a broadband soundscape where many (high-quality sounds) are mixed in, competing and somewhat masking each other
them more of an audience member as opposed to a true participant in that acoustic ecology. This current model of game sound design has slowly shifted to reflect the interactive, dynamic and personalized nature of game soundscapes, departing from the cinematic tradition and the early game 8-bit sound. It uses sound samples organized
136
in banks that are called up in real-time to be filtered and mixed in as a player progresses through a game, reflecting the quality of a space, sound behaviour and ambience in real time as well. For example, if our avatar is in an ocean setting we will hear waves, wind and seagulls; similarly, if the avatar is moving down a tunnel looking to
An Acoustic Communication Framework for Game Sound
Figure 3. Note the flow of gameplay, comprised of series of loops, varied slightly, however having a uniform attack-sustain pattern thus still sounding “loopy”, and often triggered out of temporal sync, resulting in unrealistic interruptions and overlap. Also, the stereo zoom-in reveals little if any spatialization. Elements that aren’t identified on the diagram are the background music and cave ambiance, as well as a few other uniform sound effects such as footsteps
avoid or preempt enemy attacks, out-of-frame sounds are heard as coming from their respective (implied) locations and from the appropriate distance. There are three ways in which we can examine the shifts of game sound fidelity over time. As pointed out in other game sound histories (Collins, 2008; McDonald, 2008), 8-bit sound from early fantasy and arcade-style games has evolved to polyphonic MIDI orchestrations, higher quality
rendering, and richer textures but with essentially the same melodies and game sound conventions. On the other hand, shifts in interactiveadaptive audio, as a relatively contemporary design standard, are less evident historically, but manifest themselves across different game genres and platforms. For instance, portable platforms feature only a limited sonic variety in representative/environmental sound effects, relying heavily on synthesized polyphonic mixes; more affordable
137
An Acoustic Communication Framework for Game Sound
consoles such as the Gamecube, the Wii, and the PlayStation 2 tend to feature games with more authentic soundscapes and variety, and higher-end consoles such as the PlayStation 3 and Xbox 360 flaunt stellar graphics as well as multi-channel, 3D sound capabilities capable of delivering that precision of spatialization and timbre characteristic of FPS games. Similarly, fantasy and action role playing games (RPGs) such as Final Fantasy, Prince of Persia, Assassin’s Creed 2, and God of War, to mention a few, use limited and uniform sound effects banks to build environments with minimal acoustic properties: even though the audio is less compressed in quality then in their predecessors. Higher-end military, FPS and strategy games such as Hitman, and Metal Gear Solid often combine a rich variety of high-quality sound effects rendered with 3D sound spatialization techniques and sound behaviour physics engines to simulate the temporal and spatial trajectories of competing sonic information in the game space. Finally, fidelity changes in game sound can also be discussed in terms of Schafer’s classifications of hi-fi and lo-fi soundscapes (1977; Truax, 2001, p. 21) reflecting the ecological acoustic balance in a given environment. Quite simply, as game sound has become more complex, richer in textures, and in need of accommodating an ever-expanding variety of alert cues and signals, game soundscapes have become sites for much sonic masking. If we look at Figure 2 we see a transition from a one-track synthesized music model, which lacks authentic fidelity but has little masking; to more complex games where the soundtracks become a constant broadband spectrum of high-quality music, environmental sound effects, alerts and signals, and ambience coloration. However, the newest trend in game sound design (Collins, 2008; Farnell, 2011; Hug, 2011; Phillips, 2009) might be to return to synthesis utilizing much more sophisticated tools - physical modeling and real-time sound synthesis to realistically convey not only every sound occurring
138
in a game but its every unique variation, coloration, temporal and spatial character, in interaction with other sounds within the electroacoustic environment. Such an approach to game sound synthesis would make the game soundscape truly personalized through subtlety and non-repetition, and it would reverse the tendency to use substitute aural objects or sound images from the cinematic tradition, essentially returning game sound to a realistic modelling of acoustic phenomena. However, would such a turn eliminate the necessity for purposeful sound design? Would it make it all about programmatic representation? After all, sound’s role in games is not simply descriptive, one of reflecting reality in a high-fidelity manner, but it is largely about function! Interface sounds, warning sounds, alerts, and musical earcons must continue to be part of this acoustic ecology, subject to issues of acoustic balance, masking and fidelity, as well as the informational ecology of interactive play.
the Listening Experience So what types of listening do these aspects of fidelity foster in game players/listeners? Listening is essentially a particular way of paying attention. Truax (2001) describes this phenomenon in terms of listening positions that we have developed both with regard to everyday listening and when engaging with different forms of media (pp. 19-23). Film theorists such as Chion (1994) and Murch (1995), among others, have already spoken about different listening modes: The one proposed by Chion has also been discussed and augmented by Grimshaw and Schott (2007) in their discussion of FPS games. Tuuri, Mustonen, and Pirhonen (2007) provide a more recent compelling account of listening modes in gameplay, identifying a hierarchical attentional structure of listening. Table 2 attempts to summarize popular notions of listening to game sound and organize them according to existing typologies of game functions (Jorgensen, 2006), attentional positions (Stockburger, 2007)
An Acoustic Communication Framework for Game Sound
Table 2. An attempt at linking attentional and listening positions with game functions and examples of game sound Attentional Position
Foreground
Mid Ground
Background
Game Functions
Listening Position
Examples from Gameplay
Reference Frames
Action-Oriented Functions
Analytical Listening (Truax, 2001) Listening-insearch (Truax, 2001) Semantic Listening (Chion, 1994) Causal Listening (Chion, 1994) Functional, Semantic and Critical Modes of Listening (Tuuri, Mutsonen, & Pirhonen, 2007)
Alerts: notifications, warnings, confirmation and rejection Interface sounds
Trans-diegetic
Orienting Functions Identifying Functions
Media Listening (Truax, 2001) Navigational Listening (Grimshaw & Schott, 2007) Causal & Empathetic Modes of Listening (Tuuri, Mutsonen, & Pirhonen, 2007)
Contextual sound effects Auditory icons Earcons
Diegetic
Atmospheric Functions Control-related functions
Background Listening (Truax, 2001) Reduced Listening (Chion, 1994) Reflexive & Connotative Modes of Listening (Tuuri, Mutsonen, & Pirhonen, 2007)
Musical score Environmental soundscape
Extra-diegetic
and states of diegesis (Chion, 1994; Grimshaw, 2008; Huiberts & van Tol, 2008; Jørgensen, 2006). As a side note, Jørgensen’s (2011) newest work in this book brings an important critique of the very usefulness of discussing game sound in terms of diegesis given that sound in games needs to function on many different levels besides a descriptive/immersive one and such levels may be non-diegetic according to film theory’s defini-
tion of diegesis, and yet function as diegetic cues within a game’s soundtrack. As another limitation of diegesis, I will argue in the last section of this chapter that it fails to recognize sounds outside the gameworld which may very much be part of the experience of play: the acoustic soundscape of group play, the arcade environment or online audio conferencing such as Teamspeak.
139
An Acoustic Communication Framework for Game Sound
VErIsIMILItUDE If fidelity refers to the faithfulness of sound quality in computer games, verisimilitude concerns itself with the experience and nature of truthfulness and authenticity in a game context, as conveyed through the game soundscape. In the section above we used the notion of fidelity to trace the move from synthetic tones representing real actions to realistic sound effects attached to character movements that are called up interactively to combine into a unique and (at least in principle) seamless flow. Verisimilitude addresses precisely the nature of this acoustic ecology and its claim to represent a realistic experience in both temporal and spatial terms. In its traditional literary/theatrical definition, verisimilitude reflects the extent to which a work of fiction exhibits realism or authenticity, or otherwise conforms to our sense of reality. In film, the notion of verisimilitude signifies the relative success of cinematography at creating an immersive, engaging fictional world of hyper-realistic proportions both in terms of image and sound, but also of intensity of emotion and experience (Chion, 1994; Deutch, 2003; Figgis, 2003; Murch, 1995). The core idea in this section is the notion that game sound has developed historically to conform to our sense of reality while at the same time it has constructed a sense of reality, particular to games, that we now expect.
role in Game sound: socio-cultural History Cinematic immersion works by presenting a hyperreal universe, a larger-than-life movie world with action and emotion wrought to an exaggeratedly high intensity. It both summons attention and diverts attention. Its visual and auditory elements both attract and construct an experience and work to divert the audience’s attention from realizing that what they see isn’t real. In games, this is even more the case-by definition games are interactivetheir auditory and visual elements are driven by
140
the player. So already, there is an implication that the auditeur is also a participant, hearing with the ears of the character. As Chion (2003) puts it (in relation to David Lynch’s cinematographic style): “We listen to the characters listening to us listening to them” (p. 153). In FPS games, this relationship is even clearer as the soundscape design is very intentionally oriented towards an authentic experience of listening with the character’s ears—the acoustic field shifting with the avatar’s movement on screen, the reflections, sound coloration and directionality of sounds dynamically and responsively shifting along—a mode of listening that Grimshaw (2008) defines as first-person audition (p. 83). Undoubtedly, one of the most important predecessors of game sound is sound in cinema. Expanding the context of significance to other media forms would include radio (the predecessor to film) as well as television and a particular genre of motion picture: cartoons (with their own predecessor, the paper comic). Unlike cinema, however, where sound’s role is highly artistic and affective, or radio and television, where sound is part of a programming flow (Truax, 2001, p. 169) sound in games must aspire to both aesthetic, affective as well as informational and epistemic functions. Since games are an interactive medium, these functions often overlap and are interdependent. Verisimilitude as a feature of a designed or supporting soundscape can be traced back to the early days of radio particularly with radio drama (Truax, 2001, p. 170). In the absence of a visual reference in-house generated sound effects came to play a central role in creating a realistic environment to go along with the narrative, thus inadvertently giving birth to some of the most widespread conventions of cinema and game sound: notable examples being fist-fight sounds or walking in snow sounds, the former being generated as an artificial exaggeration of what a punch would sound like, and the latter is easily simulated by grinding a fist into a bag of rice or peppercorns. Foley art, which emerged as the mainstream film
An Acoustic Communication Framework for Game Sound
sound craft in the earlier days of modern cinema, and which is experiencing a resurgence today, builds directly onto these conventions, generating an ever-increasing repertoire of techniques through which to simulate “real” sounds (typically by using other acoustic materials). In his discussion of film sound Christian Metz (1985) uses the term aural objects to refer to film’s tendency to solidify an arbitrary relationship between the viewer/listener’s perception of real sounds and the reality of the actual sound sources. The resulting realism, as pointed out not only by him, but other film theorists such as Chion (1994), Deutsh (2003), Figgis (2003) and Murch (1995), to name a few, is that film sound bites become hyper-real: We associate them with certain events and interactions in place of their authentic acoustic counterparts. For example, if someone played back the actual sound of walking in snow and the sound of close-miked grinding into a bag of rice, most of us would perceive the latter as more real. Given such a set of conventions, and media’s natural condition of being an inter-textual and self-perpetuating phenomenon, subsequent media forms and genres simply have to play on and incorporate said conventions. Or do they?
Aural Objects, Flow and space As mentioned already, the first RPGs utilized a small corpus of synthesized melodies to denote unique spaces, quintessential game moments and mood. Loosely based on music psychology conventions, these early game soundscapes used major tonality to signify an uplifting mood, minor tonality to signify danger or failure (as in Zelda or the Final Fantasy series), upward note-trill to denote jump and a downward note sequence to indicate death or end-game (as in all of the Super Mario-based and derivative series). The bigger picture in the early days consisted of having a continuously running soundtrack of synthesized music where many smaller elements, that are
meaningful in themselves mix together to create a flow of gameplay experience (McDonald, 2009) but also a game space. As with narrative support music in cinema, synthesized tunes in early games, specifically in the fantasy genre (titles such as Final Fantasy, Zelda, Castlevania and others), act as a vector (to use Chion’s (1994) term) to the temporal flow of the interactive experience and take on iconic or referential meaning (Deutch, 2003). It is precisely this quality of game sound that illustrates perfectly the distinction between fidelity and verisimilitude - as technologies, storage capacities and processing speeds of game consoles have improved over time, some games have moved towards a more and more authentic depiction of the acoustic reality, while others continue to preserve the nostalgic qualities of what Murch (1995) calls metaphoric sound, only in better sound quality (see Figure 2). Metaphoric sound—one that does not represent the action seen on the screen realistically, is so ingrained in our cultural memory that it seems odd to even point it out. Popularized by early fantasy games and their predecessors—isomorphic cartoon sounds (Altman, 1992), it contributes to a type of verisimilitude that is very different from the one richer and more realistic game genres strive for (adventure, military or FPS games). In other words, Super Mario, Zelda or Final Fantasy just wouldn’t be recognizable to their audience or, in our terms, possess verisimilitude, if it were not for their inter-textual references to iconic sounds of the past. Examples are ample the theme sounds of their game universe or even individual sound effects such as the 1-up sound, the brick-smashing sound or the jumping tune in Super Mario; the battle cries of Zelda’s Link and its iconic chest-opening sounds; or the epic combat rhythms during attacks and boss battles in Final Fantasy, among many others. Given this, sound designers for classic fantasy titles take great care to preserve these iconic sounds in each platform and each iteration of their titles. As Phillips (2009) mentions in his expose on film and game music,
141
An Acoustic Communication Framework for Game Sound
fantasy game theme songs have long transgressed the computer game genre and, particularly in Japan, are frequently re-orchestrated and performed by choirs and symphonic orchestras. Composers of game music, while largely unknown in North America have star status in most of Asia. There is another issue too: fantasy games deal with imaginary actions that no one has experienced in the real world, such as stepping on enemies’ heads, eating a giant mushroom, catching a star (references from Super Mario) and, sonically, these actions do not have ‘real’ counterparts in the acoustic reality we are familiar with. Creating the infamous sound of the lightsaber in Star Wars (McDonald, 2008) is a classic story in the history of metaphoric sound using both musical conventions and pop-psychology. Likewise, this quote from a sound designer of Torment illustrates game verisimilitude challenges perfectly: During Torment, I was processing some sword hits, and they were coming up very interesting. While they didn’t work for the spell I was working on, I gave them a description like ‘reverberant metal tones, good spell source.’ Later, I was looking for something with those qualities, but had forgotten I made those sounds. When I searched my database for ‘metal tones’, I found them, and they were exactly what I needed! (Farmer, 2009) A less discussed but highly important part of game verisimilitude is the temporal flow of the soundscape, as it is intimately linked to the tradition of sound effects and aural objects. While the fantasy sound of the past presents a highly melodic, musically semantic flow, the interactive-adaptive tradition results in a “loopy”-sounding score of slightly varied bank sound effects (i.e. there may be only one footsteps sound that is nevertheless used for all characters) organized around modules of game quests and activities but lacking an overall structure or temporal design (see Figure 3 below). Another aspect of verisimilitude in game sound has to do with creating space, specifically
142
in realistic, rich cinematic RPG/action games. I will begin with Murch’s (1995) notion of worldizing—giving a certain space acoustic qualities that make the player get involved—and combine that with Ekman’s (2005) discussion on diegetic versus non-diegetic sound as acoustic elements that do or do not belong to a gameworld. Historically, it is important to note again how early games (Collins, 2008; McDonald, 2008) instantiated the use of a melody to represent space—for example, in Final Fantasy towns have a certain melody representing the calm mood of a non-threatening environment while out-of-town wooded areas use a separate melody which is consistent everywhere in the game and represents mild danger: mission dungeons have their own musical melody and, within them, entering the space of a boss battle features a fastpaced tension music that is consistently the same throughout the game for each boss battle. Thus, these games established a situation where mood, space, and call-to-action are rolled into one and are all represented via one single melody/track. With the emergence of more powerful game consoles the notion of space becomes divorced from the conveyance of mood or a call for a particular action and becomes more representative and realistic aiming to immerse the player into a gameworld. This connects the idea of diegesis with the notion of verisimilitude through the experience of immersion, as “immersion is a mental construct resulting from perception rather than sensation” (Grimshaw & Schott, 2007, p. 476). While the cinematic concept of diegesis simply refers to whether or not the sound source is in or outside the frame, both Jørgensen (2006) and Ekman (2005) use this term to address whether a sound belongs to a gameworld or not. There is an important distinction to be made in using diegesis in this way as it puts the emphasis on immersion into the resounding space (Grimshaw & Schott, 2007) of a game and carries an implication that the gameworld already is an acoustic reality that sounds either belong or not belong to. On the other hand, regarding diegesis only as a refer-
An Acoustic Communication Framework for Game Sound
ence to in- or out-of frame sounds leaves the game soundscape intact as it assumes then that all sounds are part of the gameworld. Such an idea fits perfectly with Schafer and Truax’s notion of an acoustic community (1977; 2001): a sonic locale or context that is formed over time through a dynamic exchange between sounds, soundscape and listeners, becoming an ecology of its own that can be threatened, altered or generally disturbed by the introduction of new, foreign sounds or the removal of familiar signals that local inhabitants (players) depend upon. The question is whether it is an ecology, where the listener is consumed by the soundscape in a spectator-based relationship (Westerkamp, 1990), or if the ecology includes the player in an (inter)active co-production. Again, we have to remind ourselves that immersion is a perception, not a sensation (Grimshaw, 2008, pp. 170-174). The answer is in the ear of the listener so to speak: While even realistic games represent only a small portion of the game environment sonically (see Figure 4), they do successfully create and maintain a sense of immersion, verisimilitude, and belonging to a gameworld, not to mention conveying information through sonic signals.
LIstENING tO GAME sOUND It follows that the historic shifts of verisimilitude in game sound have affected the experience of listening as well. With the socio-cultural baggage of radio and film sound, listeners are already conditioned to accept aural objects (Metz, 1985), internalize them, and think of them as more real than the real sounds they represent. Further, listeners of game sound have adopted what Colin Ware (2004) refers to in visual studies as naive physics of perception—in the aural sense. That is, players accept and often ignore the clearly artificial behaviour of looped sound bites, their sometimes low or unrealistic quality, and their lack of diversity and complexity (see Figures 3 and 4). What Ware was trying to get to is that designers often reduce
work and design complexity by counting on the fact that players don’t need that much realism—only enough in order to be hooked. The idea being is, it is acceptable if a lot of things from the real world don’t necessarily manifest themselves sonically in the gameworld. Given this, we can now expand the framework of listening positions from Table 1 to include a pattern of attention to sound that ignores the otherwise obvious ”loopy”-ness of sound effects and as such, the predictability of game soundscapes as a whole. A listening of denial, or naive listening is perhaps a good term to use. It is not that players can’t, when prompted, identify the artificial nature of many sonic elements in a game soundscape, it is that they conditionally and purposefully ignore it, while instead immersing themselves in the experience of gameplay. Ideals of game sound become less about fidelity of acoustic sources or of audio quality and more the verisimilitude of non-engaging engagement with a holistic, interactive environment. From the discussion so far, there are a few other modes of listening that I would like to put forth, however before I introduce them, it is important to draw a link between the types of listening fostered by the flow of television and contemporary radio soundscapes, and those encouraged by the gameplay experience in general. The emergence of continuous media such as radio and TV created a brand new type of listening experience: one that Truax calls distracted or media listening (2001, p. 169). In order to accommodate viewers tuning in and out of the program and at the same time attract and keep their attention, TV sound flow uses a number of attention-management techniques such as dynamic shift changes and modular programming structure (Truax, 2001, p. 170). It essentially tells us how to listen. It trains us to increase or decrease our auditory attention by use of carefully crafted cues, until they become second nature. These gestalts of auditory perception, then, seamlessly integrate cinema and game sound, carrying the promise of total immersion, suspension of disbelief and verisimilitude. As a
143
An Acoustic Communication Framework for Game Sound
Figure 4. A sonic excerpt from Grand Theft Auto: San Andreas gameplay. While richer and more varied in dynamic range (including periods of relative silence) the game flow still consists of a series of sound effects strung together, with some distance/amplitude rendering
result, we begin relying less on active, engaged, information-processing listening, and more on habitual background and media listening in all of our surroundings (Schafer, 1977; Truax, 2001). This is not to forget however, that games are interactive, and the player is, in Schafer’s terms, a co-composer of her own game soundscape, at the same time that she listens to it. The listening positions that I’d like to add to in the interest of engaging with and critically understanding the
144
experience of computer game sound are presented in Table 3.
EcOLOGY Discussing game soundscapes as sites of local acoustic ecologies is not a novel idea (Grimshaw, 2008) and as Grimshaw and Schott (2007) point out, “the more immersive a game is the more appro-
An Acoustic Communication Framework for Game Sound
Table 3. A set of listening modes emergent out of the current discussion on fidelity, verisimilitude and the ecology of game sound. These modes reflect and attempt to identify macro trends borne out of historical shifts in the qualities, techniques and functions of game sound over time Imaginative Listening
A listening that supplies the perceptual conditions for immersion - building up a mental image of an environment from the little that is provided acoustically by the game’s soundtrack, for example, the way a game like Cooking Mama is reminiscent of Super Mario games and evokes a fun, fantastical, care-free world.
Nostalgic Listening
An analytical, culturally-critical type of listening that has emerged over time in experienced players who look for iconic game music themes through platforms and generations of a particular game (some notable examples here being the Final Fantasy, Super Mario, Zelda and Mega Man series).
Disjunctive Listening
A listening position that describes the ability that gamers develop to very quickly and fluidly interchange listening attentions—one moment they may be immersed in the heat and tension of a battle and in the next they may pause to change their settings, entering a user interface type of soundscape (for instance, in the Fallout 3 example in Figure 2, the player shifts constantly between the battlefield ambience/listening position and armour selection/target selection screens).
Naive Listening
A non-analytical, electroacoustic listening that allows the player to feel immersed into the game reality with the minimum amount of auditory complexity. In the absence of truly realistic soundscapes, players effortlessly ignore loops, repetitions and lack of sonic fidelity in order to become more immersed in the game. The name is inspired by Ware’s (2004)naive physics of perception idea.
Conditioned Listening
The type of listening that Truax (2001) calls media [flow] listening (p. 169) where players listen with an underlying expectation of how the flow of the game’s soundscape will unfold, tacitly familiar with the sonic elements of the games in general.
Inter-textual Listening
A result of cross-pollination of different media genres, this listening position addresses situations where game soundscapes contain radio, telephone, or TV sounds (most famously featured in Grand Theft Auto). Conversely, the popular events of Video Game Concerts are settings where game sounds live on outside gameworlds and are performed, listened to, and used for other purposes outside of games.
priate it is to discuss the game world in terms of an ecology and, therefore, the greater the immersive function of the game sounds’’ (p. 479). Grimshaw, unlike Schafer, analogizes game soundscapes to an actual bio-ecology where various species (in our case, sounds) interact, co-exist and are co-dependent on each other. He also focuses on the ecology of first person shooter (FPS) game soundscapes as this genre lends itself particularly well to a discussion of ecology in terms of sound. Spatialization and 3D sound rendering are honed to an art form in FPS games and the player literally has to listen through the character’s ears in order to play and succeed in the game. Sounds of shots, enemies in the background or out-of-the-frame (extra-diegetic) sounds are extremely important, as are user interface sounds including warnings and alarms that often require immediate attention and split-second decisions (trans-diegetic sounds, per Jørgensen, 2006). Schafer, however, would still look at ecology from the perspective of bal-
ance within an acoustic community where each sound has a meaning in the sonic context and a place within the spectral niche of the soundscape. This acoustic balance may or may not be in stasis: at certain times an element may mask and overpower other sonic elements. For example, in action scenes music often takes on a dominant sonic role overshadowing smaller environmental or game alert sounds (in Figure 4 it is clearly visible in the full sequence layout (top section) that music tracks have a significantly higher/broader dynamic range than all other sound effects). For Schafer, and especially for Truax whose work focuses more on electroacoustic sound, sound balance is not simply about loudness but also about value connotations. Music, for example, is not only a much stronger emotional, affective device than environmental sound within a given game environment, but it also carries a history of being used commercially, to condition consumers into spending time and money in certain settings
145
An Acoustic Communication Framework for Game Sound
(Truax, 2001; Westerkamp, 1991). As Hildegard Westerkamp (1991) points out, the phenomenon of background music is responsible for sound becoming “associated in our memories with environments and products” (1991). In essence it becomes the ambience of the media environment, however, it does not result in endless diversity of spaces and sounds but, rather, in the emergence of archetypal surrogate environments (Westerkamp, 1991). In the context of games, ominous abstract tones analogous to the cinematic model of the mood track provide such a strong emotional sense, enforced and enriched by previous generations of media listening such as film, radio and TV, that the acoustic qualities of space, reverberation, distance, location and timbre, which are the more subtle yet vital cues of everyday listening, are often lost in the ‘background’. Similarly, music in action and rhythm games often provides a promotion vehicle for indie bands whose sound is conceived as culturally related to the genre of the game itself thus perpetuating—not challenging—the status quo of popular culture and mass media. Essentially, music’s overshadowing of other sonic elements has both a cultural and a political economic implication for games in addition to an acoustic one.
Ecology of Listening While so far we have been discussing new listening patterns that emerge from the experience of game soundscapes and their socio-cultural and historical evolution, what about the listening that takes place inside game soundscapes? Does anybody listen within the game itself or is it a silent vacuum space where sound happens but no one can hear it? In other words, how would a game’s acoustic ecology change if characters in it (maybe even all of them!) could listen to one another and to the player’s character, or even to sounds outside the gameworld? In Truax’s (2001) terms this would complete the holistic relationship of true acoustic communication, uniting a constant interplay between listeners, sounds and soundscape, where
146
game characters and the player-driven avatar are all participants in the ecology. However, such algorithmic subtlety is far from reality to date and, partly due to economic reasons but also party due to notions of value, may never be a generally utilized phenomenon anyway. Even though sound in games has experienced tremendous growth and is now considered an important part of game design, development companies still invest in it considerably less than they do in visual graphics and animation. Sound designers in game development companies are typically pressured to stick with tried and true approaches to composition, design and functionality of audio, and are dissuaded from implementing “risky” new ways of using sound as part of the game mechanics. There are, of course, a few examples where sound is used in more participatory or ecological ways. For a while now Nintendo DS features a microphone input so games such as Elektroplankton and to a lesser degree titles such as Yoshi’s Island or Guitar Hero involve user-generated vocal elements into the gameplay: mostly in the form of shouting, blowing or speaking into the mike. More complex platforms support a genre called stealth games where the avatar’s own soundmaking in the game (primarily footsteps) is implied to be heard by the other non-player characters. Metal Gear Solid is the best known title, in addition to Hitman, Assassin’s Creed 2, and even youth-themed games such as Harry Potter and the Chamber of Secrets, or Zelda: The Phantom Hourglass, where Link has to walk slowly in the Temple of Time in order not to alert the phantom knights. Even at a rudimentary implementation such as linking the player/character’s speed to levels of “noise” in a given space, this approach taps into an aspect of acoustic ecology that has been largely overlooked: the character’s experience of listening within the gameworld. Acoustic Community as a Feature of Game Sound We have already discussed acoustic community in the context of game soundscapes as a conglomeration of different types of sound cues,
An Acoustic Communication Framework for Game Sound
sound functions, foreground, midground and background sounds; a community that forms over time and evokes a coherent sense of place in the gameworld. In this section, I’d like to also bring up the idea of the acoustic soundscape that is located outside the gameworld but exists synchronously to it: the sounds that surround the player in her physical environment, sounds that may or may not be related to the gameplay, but are nevertheless part of the immediate acoustic community that the player or players are in. Without focusing too much on the minutiae of less significant sonic details such as household sounds, context does offer quite a distinct sense of acoustic community depending on whether a player is at home alone, with friends, at an arcade, at a LAN party, or on a headset with online co-players (see Figure 5). A Rock Band house party, for example, is a particular community where the soundmaking of multiple players and audience members supplies much of what makes this game’s soundscape a great experience. It is precisely the exclamations of joy, frustration, encouragement—and not the designed game sound—that give this acoustic community both a sense of fidelity and verisimilitude. In
contrast, many RPG, sports or puzzle games that are played at home, even with company, result in a much quieter soundscape with sporadic and minimal interaction. Using Teamspeak or other voice chat programs for Massively multiplayer online role-playing games (MMORGs) or multi-player military strategy games results in yet another acoustic community where players’ voices have to fit seamlessly within the spectral niche of the game’s soundscape without masking or obliteration: every second counts and a lot of the designed sonic information is crucial to the gameplay (see Figure 5). Game expos, conventions and professional game championships are another quintessential acoustic community of gaming, filled with PAs (amplified public announcements), a constant arcade-like hum of game sounds: the shifting of chairs and mashing of buttons, whether players are wearing headphones or not, the murmur and exclamations of crowds. In fact the arcade environment, as Phillips (2009) points out, is responsible for some of the early choices in game sound as each game’s signature soundtrack was designed to attract attention in a loud and noisy acoustic environment of competing
Figure 5. On the left we see a recording from an arcade ambience: a constant hum of competing, masking sounds, many of which are already distorted synthetic chiptunes (in the zoom-in section). On the right we have a Teamspeak-based recording of a World of Warcraft mission: the progression (upper section) clearly reflects more verbal excitement as the team finally defeats a difficult boss, culminating into celebratory exclamations.
147
An Acoustic Communication Framework for Game Sound
game stations: hence gaming’s early and ready acceptance of sonic masking. As games moved into the home and became more technologically sophisticated, game sound changed to provide a fuller, more subtle soundscape, often to be delivered through headphones. With the emergence of MMORGs, the popularity of game tournaments, expos, LAN parties and, most recently, Guitar Hero and Rock Band house party nights, gaming is once more returning to a social model of play where the sounds of the cultural context and setting are again significant and instrumental in forming that sense of acoustic community that unites designed game sound with the incidental (acoustic and electroacoustic) sound-making and sonic environment.
cONcLUsION AND FUtUrE DIrEctIONs This chapter explores the notions of fidelity and verisimilitude manifesting historically both as global cultural conventions of media and technology, as well as, more specifically, being design goals in the production of sound in games. By exploring these two perspectives of acoustic realism through the lens of the acoustic communication framework with its focus on patterns of listening over time, acoustic communities and ecology, I hope to offer a model for future theorizing and exploration of game sound and a lens for in-depth analysis of particular game titles. As well, it is my hope that placing some much needed emphasis on listening, ecology, and the holistic acoustic setting of the gaming experience will benefit not only sound designers and game theorists but will also continue the trajectory of deepening inquiries into game studies as a rich and unique form of interactive media deserving of its own theoretical attentions. For example, before we go ahead and favour real-time audio synthesis and physical modeling for their realistic acoustic rendering (not an im-
148
minent event, I realize: science and programming still have a ways to go), we need to generate precisely the type of historical and socio-cultural analysis of game sound touched on in this chapter. We need to understand the importance of all the elements of a game soundscape, which, for better or for worse, have become important to audiences, or at the very least, we are now habituated to. There is a crucial epistemological relationship there—through inter-textual cross-pollination and transference of practices and artefacts, we have internalized many of these arbitrary meanings and a realistic physical modelling of a game soundscape might not mean much to us or even be conducive to gameplay. Designers, audio engineers and programmers need to know and think about these issues. Further, I believe the focus on listening positions in this chapter is a key to understanding not only some of the cultural practices surrounding gameplay, but it can also tell us something about auditory perception that designers or scientists could potentially use. Listening to game sound is now every bit as everyday as everyday listening goes in our media and technology-saturated environment, so games offer new opportunities to science, given the fact that contextual listening has always evaded laboratory psychoacoustic studies. Clearly, my main concerns however, are with the opportunities for critical and media studies to engage with and treat game sound and the phenomenon of listening to game sound as another rich cultural artifact—a text if you will— that can add to the layers of theory and critique surrounding media, art, and cultural expression. While the use of fidelity and verisimilitude are only two relevant heuristics in the analysis of game sound, it is my hope that the field of media studies will identify others and conduct the same kind of rigorous examination of their historical and cultural roots in order to elucidate their role and importance not only in game sound but in our culture-at-large today.
An Acoustic Communication Framework for Game Sound
Finally, my sense of the future directions in the field of game sound is that, as the game industry matures, as playing computer games starts to lose some of its negative notoriety, naturally there is more and more societal and media attention on games as well as on game elements such as sound. With that, increased popularity of gaming results in industry growth, expanding game genres, expanding the notions of what a game is, how it is played and how it is experienced. Sound plays a crucial role in experience and interactivity and there has been an increased design attention to it both from industry as well as from independent artists. With that comes a book like this one and my prediction is that there will be (hopefully) more to come from scholars, critics, media theorists, sociologists, scientists, and designers who would be now better equipped to continue this in-depth conversation about game sound and listening in a way that preserves the complex ecology of people’s interactions with their (media/ techno)-soundscapes while expanding the multidisciplinary nature of this maturing field. There has been a resurgence of concern over noise and the urban soundscape coming back into public attention in the context of environmentalism and sustainability and, well, it only takes one look at the history of game sound, inter-related with similar media forms and genres, to glean its influence on the way in which we listen, make sense of and experience our physical offline soundscape. More work in this area is not only needed, but is, I am confident, bound to come.
rEFErENcEs Altman, R. (1992). Sound theory sound practice. London: Routledge. Assassin’s Creed 2. (2009). Ubisoft Montreal. Ubisoft.
Brandon, A. (2004). Audio for games: Planning, process, and production. Berkeley, CA: New Riders Games. Castlevania. (1989). Konami Digital Entertainment. Chion, M. (1994). Audio-vision, sound on screen. New York: Columbia University Press. Chion, M. (1999). The voice in cinema. New York: Columbia University Press. Chion, M. (2003). The silence of the loudspeaker or why with Dolby sound it is the film that listens to us . In Sider, L., Freeman, D., & Sider, J. (Eds.), Soundscape: The School of Sound lectures 19982001 (pp. 150–154). London: Wallflower Press. Collins, K. (2007). An introduction to the participatory and non-linear aspects of video game audio . In Hawkins, S., & Richardson, J. (Eds.), Essays on sound and vision (pp. ##-##). Helsinki: Helsinki University Press. Collins, K. (2008). Game audio: An introduction to the history, theory, and practice of video game music and sound design. Cambridge, MA: MIT Press. Cook, P. (Ed.). (1999). Music, cognition, and computerized sound: An introduction to psychoacoustics. Cambridge, MA: MIT Press. Cooking Mama. (2007). OfficeCreate. Majesco Publishing. Deutch, S. (2003). Music for interactive moving pictures . In Sider, L., Freeman, D., & Sider, J. (Eds.), Soundscape: The School of Sound lectures 1998-2001 (pp. 28–34). London: Wallflower Press. Ekman, I. (2005). Understanding sound effects in computer games. In Proceedings of the 6th Annual Digital Arts and Cultures Conference, 2005, Copenhagen, Denmark: IT University Press.
149
An Acoustic Communication Framework for Game Sound
Electroplankton. (2006). Nintendo America. Nintendo. Fallout 3. (2008). Bethesda Softworks. Bethesda Game Studios. Farmer, D. (2009). The making of Torment audio. Retrieved July 9, 2009, from http://www.filmsound.org/game-audio/audio.html. Farnell, A. (2011). Behaviour, structure and causality in procedural audio . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global. Figgis, M. (2003). Silence: The absence of sound . In Sider, L., Freeman, D., & Sider, J. (Eds.), Soundscape: The School of Sound lectures 19982001 (pp. 1–14). London: Wallflower Press. Final Fantasy 2. (1988). Squaresoft. Square ENIX. Friberg, J., & Gärdenfors, D. (2004). Audio games: New perspectives on game audio. In Proceedings of ACM SIGCHI International Conference on Advances in Computer Entertainment Technology (pp. 148-154). Gaver, W. (1994). Using and creating auditory icons. In G. Kramer (Ed.). Auditory Display: Signification, Audification, and Auditory Interfaces (Santa Fe Institute Studies in the Sciences of Complexity, Vol. 18, pp. 417-446). Reading, MA: Addison-Wesley. God of War 2. (2007). SCE Studios Santa Monica. Sony Computer Entertainment. Grand Theft Auto. (2004). San Andreas. Rockstar North. Rockstar Games. Grimshaw, M. (2008). The acoustic ecology of the first-person shooter: The player, sound and immersion in the first-person shooter computer game. Saarbrücken, Country: VDM.
150
Grimshaw, M., & Schott, G. (2007). Situating gaming as a sonic experience: The acoustic ecology of first-person shooters. In Proceedings of the Third Digital Games Research Association Conference (pp. 474-481). Guitar Hero. (2006). Harmonix. Rec Octane. Harry Potter and the Chamber of Secrets. (2002). Eurocom. Electronic Arts. Hitman. (2002). Io Interactive. Eidos Interactive. Hug, D. (2011). New wine in new skins: sketching the future of game sound design . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global. Huiberts, S., & van Tol, R. (2008). IEZA: A framework for game audio, Gamasutra, Retrieved April 4, 2009, from http://www.gamasutra.com/view/ feature/3509/ieza_a_framework_for_game_audio.php?page=3. Jørgensen, K. (2006). On the functional aspects of computer game audio. In Proceedings of the first International AudioMostly Conference (pp. 48-52). Jørgensen, K. (2011). Time for new terminology? Diegetic and non-diegetic sounds in computer games revisited . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global. Marks, A. (2009). The complete guide to game audio: For composers, musicians, sound designers, game developers (2nd ed.). Location: Elsevier Press. McDonald, G. (2008). A brief timeline of video game music. Retrieved July 8, 2009, from http:// www.gamespot.com/gamespot/features/video/ vg_music/. Mega Man. (1993). Capcom. Capcom Entertainment.
An Acoustic Communication Framework for Game Sound
Metal Gear Solid. (1998). Konami Japan. Konami Computer Entertainment. Metz, C. (1985). Aural objects . In Belton, E. W. J. (Ed.), Film sound (pp. ##-##). New York: Columbia University Press. Murch, W. (1995). Sound design: The dancing shadow . In Boorman, J., Luddy, T., Thomson, D., & Donohue, W. (Eds.), Projections 4: Filmmakers on film-making (pp. 237–251). London: Faber and Faber. Phillips, N. (2009). From films to games, from analog to digital: Two revolutions in multi-media! Retrieved July 8, 2009, from http://www.filmsound.org/game-audio/film_game_parallels.htm. Planescape: Torment. (2005). Black Island Studios. Interplay. Rock Band. (2008). Harmonix. MTV Games. Roeber, N., Deutschmann, E. C., & Masuch, M. (2006). Authoring of 3D virtual auditory environments. In Proceedings of the First International AudioMostly Conference (pp. 15-21). Schafer, R. M. (1977). The tuning of the world. Toronto: McClelland and Stewart. Spyro the Dragon. (2008). Insomniac Games. Sony Computer Entertainment. Stockburger, A. (2007). Listen to the iceberg: On the impact of sound in digital games . In von Borries, F., Walz, S. P., & Böttger, M. (Eds.), Space time play: Computer games, architecture and urbanism: The next level (pp. ##-##). Location: Birkhäuser Publishing. Super Mario Bros. NES (1985). Nintendo. Nintendo. Truax, B. (2001). Acoustic communication (2nd ed.). Location: Ablex Publishing.
Tuuri, K., Mustonen, M., & Pirhonen, A. (2007). Same sound—different meanings: A novel scheme for modes of listening. In Proceedings of the Second International AudioMostly Conference, 2007, 13-18. Ware, C. (2004). Information visualization: Perception for design (2nd ed.). Location: Morgan Kaufman Publishing. Westerkamp, H. (1990). Listening and soundmaking: A study of music-as-environment . In Lander, D., & Lexier, M. (Eds.), Sound by artists (pp. ##-##). Location: Art Metropole & Walter Phillips Gallery. Wilhelmsson, U., & Wallén, J. (2011). A combined model for the structuring of computer game audio . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global. World of Warcraft. (2004). Blizzard Entertainment. Blizzard. Yoshi’s Island. (2007). Nintendo Japan. Nintendo. Zelda: Phantom Hourglass. (2007). Nintendo. Nintendo.
KEY tErMs AND DEFINItIONs Acoustic Community: A term emerging from Schafer and Truax’s work with the WSP (World Soundscape Project) in the 1970s referring to stable sonic locales that include a set of sound which clearly belong there and characterize a community: For example, the sounds of coin machines, yelling, and synthesized music all belong to an arcade acoustic community. Acoustic Ecology: A movement started by R.M. Schafer and continued through the World Forum for Acoustic Ecology (WFAE) and also a term denoting the sonic balance in a given soundscape through its signal-to-noise ratio.
151
An Acoustic Communication Framework for Game Sound
Chiptunes: A term that has been now popularized by the arts community, referring to 8-bit synthesized melodies or single tones that were originally directly encoded onto a game’s electronic chip memory, in early game development. Diegesis: A term from film studies referring to what is in-the-frame of the screen as opposed to what isn’t. In game sound studies, Chion is credited with popularizing it to refer to sounds that are in or out of frame from the player’s perspective. It has also been used by others to refer to sounds that do or do not belong to the gameworld. Fidelity: Literally means faithfulness and here, it refers to the audio quality of a sound reproduction relative to its original acoustic source. Listening Positions: Developed by B. Truax as a term, it refers to types of listening attentions that have become patterns over time with exposure to certain types of sound environments, habits, or media, that is, background listening is a passive
152
form of listening attention that we all engage in at different times. Loopy: An adapted term I am using here to denote the quality of game sound flow in many RPG games where short looped sounds from an effects bank are triggered each time an action is performed, thereby often sounding cut-off, toosimilar, or simply uniform. Soundscape: A term coined by R.M. Schafer to describe the totality of sounds surrounding us at any given time/place: analogous to a landscape. Verisimilitude: Literally means similar to reality and it is a theatrical term referring to the ability of an artwork to appear real, to foster a sense of realism in the audience. Here, it refers to the ability of game soundscapes to sound real.
153
Chapter 8
Perceived Quality in Game Audio Ulrich Reiter Norwegian University of Science and Technology, Norway
AbstrAct This chapter reviews game audio from a Quality of Experience point of view. It describes cross-modal interaction of auditory and visual stimuli, re-introduces the concept of plausibility, and discusses issues of interactivity and attention as the basis for the qualitative, high-level salience model being suggested here. The model is substantiated by experimental results indicating that interaction or task located in the audio domain clearly influences the perceived audio quality. Cross-modal influence, with interaction or task located in a different (for example, visual) domain, is possible, but is significantly harder to predict and evaluate.
INtrODUctION Perceived quality in game audio is not a question of audio quality alone. As audio is usually only a part in an overall game concept consisting of graphics, physics, artificial intelligence, user input, feedback and so forth, audio has been considered to play a relatively minor role in the overall experience that a game provides. Consequently, a lot of effort has been put into providing near photo-realistic representations of (virtual) game scenarios to the player, but only little into audio. Interestingly, DOI: 10.4018/978-1-61692-828-5.ch008
this assessment has had to be revised over the last years. Learning from other artistic fields like cinema, in which storytelling is a central means of providing “user experience”, game developers have come to know that audio can trigger emotions and provide additional information otherwise hard to convey. Today, although budgets are still limited compared to other aspects of game engineering, audio in games is given more attention by the game developers than ever before. But there is more to audio in games than just an emotional support for a story. Most games are user-centered and non-linear, as opposed to the linear story telling of traditional, non-interactive
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Perceived Quality in Game Audio
content presentation. Therefore, the audio has to be manipulated in real-time depending on the player’s actions. Real-time processing of audio can become computationally very demanding and is a problem for complex game scenarios. This has introduced the concept of plausibility: the main goal in game audio is not to have an audio simulation as exact and close to reality as possible, but to render audio that is plausible in the game scenario, and that provides an overall quality impression that matches the other aspects of the game. One fact well known from home cinema applications is that an improved quality in video can also increase the subjectively perceived audio quality, and that the reverse effect also exists (Beerends & De Caluwe, 1999). It is therefore a most interesting question to see whether these effects can be exploited to increase the subjectively perceived overall quality of a game without actually increasing the computational load. Instead of just rendering more details (equivalent to a higher simulation depth), focusing on those details that are actually relevant in a certain context could provide a much higher Quality of Experience (QoE) (see Farnell, 2011 for a discussion of relevancy and redundancy in procedural audio design). The central question is, therefore, which stimuli in a game scenario are of most importance? Can information that is difficult and cost-intensive to convey in one modality be presented in another modality with less effort but similar perceptual impact? What role does interactivity play in the perception of quality? What are the technical parameters that can influence the perceived quality of a game, and which other factors exist that potentially dominate the perceptual process? This chapter aims at identifying and discussing general quality criteria in multimedia application systems with a focus on games. These criteria contain technical as well as human factors. In order to understand these factors, the first section touches upon the mechanisms of human percep-
154
tion: well-known facts about visual and auditory perception are summarized briefly. The second section presents a discussion of cross-modal influences, that is, interaction between auditory and visual stimuli in the perceptual apparatus, and cross-modality in general. A survey detailing the most accepted theories of how audiovisual (bimodal) perception is achieved in the human brain is given. This is far more complex than just adding the results of auditory and visual processing and is therefore worth an extended discussion. This is followed by examples of effects in bimodal perception (based on research in the fields of psychology and cognitive sciences) that can be relevant in the context of game audio. The third section discusses the concept of auditory and audio-visual plausibility. It briefly compares the requirements for exact (room) acoustic simulations versus real-time rendering and details the constraints resulting for computer games. The next section gives an overview on issues related to interactivity, such as latency, user input, and perceptual feedback. Interactivity is closely related to the generation of presence, defined as the “perceptual illusion of non-mediation”, or simply the feeling of “being there”. The concept of presence is discussed as an indirect measure for perceived quality. The fifth section elaborates on the concept of attention. The perception of multiple streams is discussed and an introduction to the general model of the Perceptual Cycle according to Neisser is given. From this, the concepts of selective attention and divided attention are discussed and capacity limits of the human perceptual system are explained. Finally, in the sixth section, the resulting factors (technical as well as human) are arranged to form a qualitative model describing human audio-visual perception based on saliency of stimuli. Such a model can serve as a basis for determining the QoE in games in general and specifically for game audio. Experimental results documenting inner-
Perceived Quality in Game Audio
modal versus cross-modal effects on perceived audio quality are summarized. Finally, a summary is given that reviews the most important concepts leading to the salience model presented in the preceding section. Further research potential is defined.
MEcHANIsMs OF HUMAN PErcEPtION Vision (sight) and audition (hearing) are the most important human senses for playing games. In the real world, these senses provide us with information about the more remote surroundings, as opposed to taste (gustation), smell (olfaction), and touch (taction or pressure) which provide information about our immediate vicinity. Because vision and audition communicate spatial and temporal relations of objects, and because the necessary technology to stimulate the two is readily available on computer systems used in the home, most games only stimulate the two.
Visual Perception Vision mainly serves to indicate spatial correlation of objects, as the human visual system seldom responds to direct light stimulation. Rather, light is reflected by objects and thus transmits information about certain characteristics of the object. The direction of a visually perceived object corresponds directly to the position of its image on the retina, the place where the light receptors are located in the eye. At the same time, a visual stimulus occupies a position in perceptual space that is defined relative to a distance axis, as well as to the vertical and horizontal axes. In the determination of an object’s distance to the eye, there are a number of potential cues of depth. These include monocular mechanisms like interposition, size, and linear perspective as well as binocular cues like convergence and disparity. All of these are usually evaluated jointly, allow-
ing us to solve even ambiguous situations with contradicting sensory information. All these depth cues can be exploited even when the environment is at rest. As soon as motion (of objects or of the head) is present, motion parallax takes on an important role in depth perception. Motion parallax describes the fact that the image of an object far away from the viewer moves more slowly across the retina than the image of an object at a close distance. Motion parallax also provides cues in the monocular case.
Auditory Perception Auditory stimuli are perceived to be localized in space. The sound is not heard within the ear, but it is phenomenally positioned at the source of the sound. In order to localize a sound, the auditory system relies on binaural and monaural acoustic cues. Directional hearing in the horizontal plane (azimuth) is dominated by two mechanisms which exploit binaural time differences and binaural intensity differences. For sinusoidal signals, interaural time differences (ITDs, the same stimulus arriving at different times at the left and the right ear) can be interpreted by the human hearing system as directional cues from around 80Hz up to a maximum frequency of around 1500Hz. This maximum frequency corresponds to a wavelength of roughly the distance between the two ears. For higher frequencies, more than one wavelength fits between the two ears, making the comparison of phase information between left and right ear equivocal (Braasch, 2005). For signals with frequencies above 1500Hz, interaural level differences (ILDs) between the two ears are the primary cues (Blauert, 2001). Regardless of the source position, ILDs are small at low frequencies. This is because the dimensions of head and pinnae (the outer ear visible on the side of the head) are small in comparison to the wavelengths at frequencies below about 1500Hz. Therefore they do not represent any noteworthy obstacle for the propagation of sound.
155
Perceived Quality in Game Audio
Directional hearing in the vertical plane (elevation) is dominated by monaural cues. These stem from direction-dependent spectral filtering caused by reflection and diffraction at the torso, head, and pinnae. Each direction of incidence (for instance, defined in terms of azimuth and elevation) is related to a unique spectral filtering for each individual. This spectral filtering can be described by head-related transfer functions (HRTFs). In addition to providing localization of sounds in the vertical plane, these spectral cues are also essential for resolving front-back confusions (Blauert, 2001). Pulkki (2001) reports that, for elevation perception, frequencies around 6kHz are especially important. In everyday situations, localization of sound sources seldom relies on auditory cues alone. Knowledge of the potential source of a sound (for example, airplane noises from above, or crunching shoes from below) aids in the localization process. Visual cues heavily influence the localization of sound sources.
crOss-MODAL INtErActION bEtWEEN AUDIO AND VIDEO Human perception in real world situations is a multi-modal, recursive process. Stimuli from different modalities usually complement each other and make the perceptual process more unequivocal. Only those stimuli that can actually be perceived by the primary receptors of sound, light, pressure and so on contribute to an overall impression (which is the result of any perceptual process). The human perceptual process, because of its complexity, cannot easily be explained in a simple block diagram without neglecting important features. A number of descriptive models exist, but these only cover certain aspects of the process, depending on the level of abstraction at which the respective model is located. Relatively little is known about the mechanisms of multi-modal processing in the human brain.
156
The main questions with respect to audio-visual perception are: At what level of perceptual processing do cross-modal interactions occur? And what mechanism underlies them?
Joint Processing of AudioVisual stimuli As early as 1909, Brodmann suggested a division of the cerebral cortex into 52 distinct regions, based on their histological characteristics (Brodmann, 1909). These areas, today called Brodmann areas, have later been associated to nervous functions. The most important areas in the audio-visual game context are Primary Visual Cortex (V1), Visual Association Cortex (V2 and V3), as well as Primary Auditory Cortex and anterior and posterior transverse temporal areas (H). This division suggests that the different modalities are related to separate regions of the brain, and that processing of stimuli is performed separately for each modality. Taking a closer look at the brain reveals that the neurons of the neocortex are arranged in six horizontal layers, parallel to the surface. The functional units of cortical activity are organized in groups of neurons. These are connected by four types of fibers, of which the association fibers are especially interesting when looking at information exchange between cortical areas. Short association fibers (called loops) connect adjacent gyri, whereas long association fibers form bundles to connect more distant gyri in the same hemisphere. These association bundles give fibers to and receive fibers from the overlying gyri along their routes. They occupy most of the space underneath the cortex. There are many such connections between different functional areas of the neocortex such that information can be exchanged between them and true multi-modal processing can be achieved. Goldstein (2002) gives an example of a red, rolling ball entering our field of view. Locally distinct neurons are then activated by either motion, shape, or color. Subsequently, dorsal and
Perceived Quality in Game Audio
ventral streams are also activated. Although the involved neurons are locally distinct, we perceive one singular object, not separate rolling, red color, or round shape. Until now, it is unclear how the processing of multiple characteristics of a single object is organized. A number of theories have been suggested to explain this binding problem, and the exploration of binding in the visual system has become a heavily discussed topic. According to Goldstein (2002), the most prominent theory, suggested by Singer, Engel, Kreiter, Munk, Neuenschwander, and Roelfsema (1997), assumes that visual objects are represented by groups of neurons. These so-called cell-assemblies are activated jointly, producing an oscillatory response. This way, neurons belonging to the same cell-assembly can synchronize. Whenever the reaction to stimuli is synchronized, this means that the respective cortical areas are processing data coming from one single object or context. Yet, this binding by synchrony theory has left doubts with respect to the interpretation and processing of the synchrony code. For example, Klein, König, and Körding (2003) postulate that “many properties of the mammalian visual system can be explained as leading to optimally sparse neural responses in response to pictures of natural scenes” (p. 659). According to Goldstein (2002), many others argue that binding can be explained by (selective) attention. Attention is discussed below.
Dominance of single Modalities Very often the dominance of visual stimuli over other modalities is accepted naturally as a given. In fact, looking at our everyday experiences we might be inclined to accept this posit without further discussion: because “seeing is believing”, we often think that we tend to trust our eyes more than the other senses. Yet, this appraisement is often due to the fact that in the real world we seldom have to face contradictions in the multi-modal
stimuli perceived by our senses. There is actually no need to consciously further evaluate the different percepts in terms of relevance, because they usually complement (and not contradict) each other. In order to actually verify any naturally given order of significance of the perceived stimuli, it is necessary to present the human perceptual system with contradictory sensory information and see what the generally dominating modality is—if there is any. There have been a number of scientific efforts to explain in a perceptual relevance model how the human perceptual system weighs the different contradicting percepts. Two such models have been proposed to describe how perceptual judgments are made when signals from different modalities are conflicting. One of these models suggests that the signal that is typically most reliable dominates the competition completely in a winner-take-it-all fashion: the judgment is based exclusively on the dominant signal. In the context of spatial localization based on visual and auditory cues, this model is called visual capture because localization judgments are made based on visual information only. The other model suggests that perceptual judgments are based on a mixture of information originating from multiple modalities. This can be described as an optimal model of sensory integration which has been derived based on the maximum-likelihood estimation (MLE) theory. This model assumes that the percepts in the different modalities are statistically independent and that the estimate of a property under examination by a human observer has a normal distribution. In engineering literature, the MLE model is also known as the Kalman Filter (Kalman & Bucy, 1961). Battaglia, Jacobs, and Aslin (2003) report that several investigators have examined whether human adults actually combine information from multiple sensory sources in a statistically optimal manner (that is, according to the MLE model). They explain:
157
Perceived Quality in Game Audio
According to this model, a sensory source is reliable if the distribution of inferences based on that source has a relatively small variance; otherwise the source is regarded as unreliable. More-reliable sources are assigned a larger weight in a linearcue-combination rule, and less reliable sources are assigned a smaller weight. (Battaglia et al., 2003, p. 1391) Looking at it this way, visual capture is just a special case of the MLE model: the highly reliable percept (the visual cue) is assigned a weight of one, whereas the less reliable percept (the auditory cue) is assigned a weight of zero. Battaglia et al. (2003) describe an experiment designed to answer the question whether human observers localize events presented simultaneously in the auditory and visual domain in a way that is best predicted by the visual capture model or by the MLE model. Their report suggests that both models are partially correct and that a hybrid model may provide the best account of their subjects’ performances. As greater amounts of noise were added to the visual signal, subjects used more and more information perceived via the auditory channel, as suggested by the MLE model. Yet most notably, according to their analysis, test subjects seemed to be biased towards using visual information to a greater extent than originally predicted by the MLE model. This means that the model used in the experiments committed a systematic error by constantly underestimating the test subjects’ use of visual information (thus overestimating the use of auditory information). Shams, Kamitani, and Shimojo (2000, 2002) describe experiments in which visual illusion was induced by sound, resulting in the auditory cue outweighing the visual cue. They presented test subjects with flashes of light and beeps of sound: whenever a single flash of light was accompanied by multiple auditory beeps, the single flash was perceived as multiple flashes. They conclude that this alteration of the visual percept is caused by cross-modal perceptual interactions, rather than
158
having cognitive, attentional, or other origins. This is especially interesting as there was no degradation in the quality of the visual percept offered, which otherwise inevitably provokes the human perceptual system to rely on other modalities. To sum up, the combined results of these experiments suggest that there is no clear, generalized bias of humans toward any of the available modalities in terms of dominance. Apparently, there is no such thing as a general dominance of visual percepts over other stimuli. Instead, whenever such a bias toward any of the available modalities exists, this seems to be highly dependent on the context. Whereas Battaglia et al. (2003) tested subjects for contradicting localization cues and were presented with a bias toward the visual percept, Shams et al. (2000) tested subjects for temporal variations of cues and were presented with a bias toward the auditory percept. This actually indicates that the human perceptual system tends to prefer those senses (give a higher weight to those percepts) that promise a higher degree of reliability or resolution for the presented perceptual problem: Whereas the horizontal resolution of the human auditory system is roughly 2 to 3 degrees for sinusoidal signals coming from a forward direction (Zwicker & Fastl, 1999), the resolution of the visual system is at least 100 times as high, about 1 min. of arc (Howard, 1982). On the other hand, the time resolution of the auditory system allows to resolve the temporal structure of sounds as close as 2ms (Zwicker & Fastl, 1999), whereas the human visual system can be tricked into believing in a continuously moving object when presented with only 24 sampled pictures of the continuous movement per second.
AUDItOrY AND AUDIOVIsUAL PLAUsIbILItY In classic room acoustic simulation, the time necessary to render the room audible (in other words, to perform the room acoustic simulation
Perceived Quality in Game Audio
itself), is often considered second-rank. Instead, the (acoustic) similarity between the simulation and the real situation is considered most important. In games, this situation is reversed: the available computational power is critical, and rendering has to be performed in real-time. Therefore, the concept of plausibility is applied: as long as there is no obvious contradiction between the visual and the acoustic representation of a virtual scene, the human senses merge auditory and visual impressions. Hence, it is usually possible to replace a cost-intensive geometry-based room acoustic simulation with a generic reverberation algorithm, for example, with combinations of all pass filters and delays according to Schroeder (1962, 1970), with nested all pass filters according to Gardner (1992), or with feedback delay line structures according to Jot and Chaigne (1991). This way, the auditory part of the presentation provides a rough sketch of the room’s characteristics, whereas the visual part complements the overall impression with an increased level of detail. As long as the information provided in the two modalities is not contradictory, there is a high chance that the player’s perceptual apparatus merges the stimuli and blends them to form a single, multi-modal representation of the scene. In general, it might be arguable whether a “perfect” reproduction of the properties of a real life experience will ever be possible in a computer game at all (with the assumption that a simulation is good enough as long as there is no perceptual difference to reality detectable by the human senses in the given situation). A lesser interpretation of this applies to scenes which have no counterpart in reality: their appearance needs to be plausible in every aspect and also in a sense of perfect agreement between the cues offered by the system in the different perceptual domains. In the context of games, this requirement can be further reduced. Because the visual representation of the scene is limited to a region in the frontal area and is not supposed to fill the field of view entirely, it suffices to require that the one part of
the virtual scene that is displayed (audio-visually) is perceived as plausible. It is thus accepted that stimuli coming from the surrounding real world (which cannot be entirely excluded in a typical computer game playing environment) might interfere with those from the virtual scene. Furthermore, the time and investment necessary to develop completely accurate auditory and visual models is as much of a limiting factor for how much detail will be rendered, as is the computational power alone. It is therefore reasonable to focus only on the most important stimuli and leave out those that would go unnoticed in a real world situation. In order to do so, it is necessary to predict what the most important stimuli or objects in the overall audio-visual percept are.
INtErActIVItY IssUEs AND PrEsENcE The concept of interactivity has been defined by Lee, Jin, Park, and Kang (2005) and Lee, Jeong, Park, and Ryu (2007) based on three major viewpoints: technology-oriented, communicationsetting oriented, and individual-oriented views. Here, the technology-oriented view of interactivity is adopted, which “defines interactivity as a characteristic of new technologies that makes an individual’s participation in a communication setting possible and efficient” (Lee et al., 2007). Steuer (1992) holds that interactivity is a stimulus-driven variable which is determined by the technological structure of the medium. According to Steuer, interactivity is “the extent to which users can participate in modifying the form and content of a mediated environment in real time” (p. 14) —in other words, the degree to which users can influence a target environment. He identifies three factors that contribute to interactivity: •
speed (the rate at which input can be assimilated into the mediated environment)
159
Perceived Quality in Game Audio
• •
range (the number of possibilities for action at any given time) mapping (the ability of a system to map its controls to changes in the mediated environment in a natural and predictable manner).
These factors are related to technological constraints that come into play when an application is supposed to provide interactivity to the user, as is the case for computer games. These technological constraints are briefly discussed in the following subsections.
Latency Latency is one of the main concerns in computer games. Latency in the context of interactivity can be defined as the time that elapses between a user input and the apparent reaction of the system to that input. It is closely related to Steuer’s speed factor. Latencies are introduced by individual components of the system. These components may include input devices, signal processing algorithms, device drivers, communication lines and so on. Although these components may interact in more than one way on a game platform, a system’s end-to-end latency should not vary over time to make it predictable. Meehan, Razzaque, Whitton, and Brooks (2003) report a study in which they tested the perceived sense of presence (see below) for two different end-to-end latencies in a Virtual Environment (VE). The low latency was 50ms, the high latency was 90ms. Test subjects were presented with a relaxing environment that was switched to a threatening one and their response was observed. Meehan et al. report that subjects in the lowlatency group had a higher self-reported sense of presence and a statistically higher change in heart rate between presentations of the two situations. MacKenzie and Ware (1993) conducted the first quantitative experiments with respect to effects of visual latency. Participants completed a
160
Fitts’ Law target acquisition task in which they had to move the mouse from a starting point to a target, with a latency of between 25ms and 225ms from moving the mouse to actually seeing the cursor move on the screen. The authors report that the threshold at which latency started to affect the performance was approximately 75ms. This effect was also dependent on the difficulty of the task: the harder the task, the greater was the adverse effect caused by increased latency. Wenzel (1998, 1999, 2001) has published a number of reports about the impact of system latency on dynamic performance in virtual acoustic environments with a focus on localization of sound sources. The bottom line is that depending on the source velocity of the audio signal itself, localization of sound sources might be impaired when total system latency (end-to-end latency) is higher than around 60ms for audio-only presentations (Wenzel, 1998). On the other hand, error rates in an active localization task, tested on an HRTF-based reproduction system, showed comparable error rates for both low and very high latencies suggesting that subjects were largely able to ignore latency altogether (Wenzel, 2001). Nordahl (2005) examined the impact of selfinduced footstep sounds on the perception of presence and latency. Interestingly, for audio-visual feedback in a VE, the maximum sound delay that was possible without latency being perceived as such was around 50% higher than for the audioonly feedback case (mean values of 60.9ms against 41.7ms). Nordahl explains this as attention being focused mainly on the visual, rather than the auditory feedback in the audio-visual case. Looking at these experimental results, it is difficult to draw a general conclusion on the maximum allowed latency for computer games. Apparently, the perception of latency as such depends on the system setup itself (screen, loudspeakers/ headphones, for example), on the task, and on the content that is displayed. At the same time, measuring total system latency correctly is not a trivial task. Therefore, a general recommendation
Perceived Quality in Game Audio
would be to keep latency as low as possible within any such system, that is, preferably below 50ms.
Input and Perceptual Feedback Perceptual feedback is the response that a system provides to the player’s input. In games, perceptual feedback is usually provided in the auditory and visual domains. Input provided by the player can, in the general case, consist of any kind of signal accepted by the system for controlling it: speech, gesture, haptic control, eye tracking and so forth. Input and perceptual feedback are related to Steuer’s (1992)mapping factor and his range factor is related to the kind of interaction that is offered by the game. This depends strongly on the goal of the application or game itself. In a first-person shooter, players might expect a different range of interaction than in a business simulation game. Hence, both input and perceptual feedback define the degree of interactivity a game player can experience.
Presence Closely related to interactivity is presence. Larsson, Västfjäll, and Kleiner (2003) define presence in interactive audio-visual application systems or VEs “as the feeling of ‘being there’” (p. 98), and as the element that generates involvement of the user. Lombard and Ditton (1997) define presence in a broader sense as the “perceptual illusion of nonmediation” (p. 24). According to Steuer (1992), the level of interactivity (degree to which users can influence the target environment) has been found to be one of the key factors for the degree of involvement of a user. Steuer has found vividness (ability to technologically display sensory rich environments) to be the second fundamental component of presence. Along the same lines, Sheridan (1994) assumes the quality and extent of sensory information that is fed back to the user, as well as exploration and manipulation capabilities, to be crucial for the
subjective feeling of presence. Other factors have been found to be determinants for presence but these depend on the theoretical concept applied by the researcher. Ellis (1996) points out that presence may not necessarily be the ultimate goal of every interactive audio-visual application system. He holds that successful task accomplishment can be far more important than presence, especially in situations “where the medium itself is not the message” (p. 253). This is easily accepted for player-game interaction, but is also applicable to communication between players in a multi-player game environment, when players have to team up to achieve a certain goal.
AttENtION When being confronted with an increased number of stimuli, the human perceptual apparatus will try to keep up with the processing required for the input on offer. Generally, this can be achieved using different strategies. According to Pashler (1999), all of them are usually referred to as attention. Many human activities require that information from a multitude of sources is taken in. When we attempt to monitor one stream of information, we pay attention to the source. Usually, natural scenes are multi-modal, thus providing information in more than one modality. Also, natural scenes usually provide more than one informational stream. The question is then, how is attention distributed if a multitude of information is presented in more than one stream? What role does multi-modality of the information play in computer games?
Perception of Multiple streams Eijkman and Vendrik (1965) conducted one of the earliest studies on the perception of bimodal stimuli. They asked test subjects to detect increments in the intensity of light and tones. The stimuli lasted one second and were presented
161
Perceived Quality in Game Audio
either separately or simultaneously. Subjects detected the increments in one modality without interference from simultaneously monitoring the other modality, and performance of detection was comparable to that of only monitoring one modality. Other studies, for example, Shiffrin and Grantham (1974) and by Gescheider, Sager, and Ruffolo (1975), also support these results for presentations of short bimodal stimuli. As the stimuli presented in the auditory and the visual modalities were not contextually related in the study of Eijkman and Vendrik (1965), they constituted what could be called separate perceptual streams. Yet, detection of increments in the duration of the same stimuli was showing marked interference. This suggests that temporal judgments might be processed by the same processing system (the same cortical areas), a theory that is further supported by the findings of Shams et al. (2000, 2002) already discussed in the subsection on visual dominance. Interestingly, other studies combining auditory and visual discrimination tasks showed modest but considerable decrements in terms of performance. This was observed when test subjects were confronted with bimodal stimuli in comparison to unimodal ones. To give an example, Tulving
and Lindsay (1967) presented test subjects with tones and patches of light. Subjects were asked to judge the intensity of either tone or light, and results were compared to the bimodal judgment of intensity of both stimuli. All of these studies characteristically involve magnitude judgments rather than categorical judgments. Therefore, the performance of test subjects in the bimodal case might have been limited by the difficulty of maintaining a standard in memory against which to judge the inputs, rather than by the influence of a second modality itself.
the Perceptual cycle Neisser’s model of the Perceptual Cycle describes perception as a setup of schemata, perceptual exploration and stimulus environment (Farris, 2003). These elements influence each other in a continuously updated circular process, see Figure 1. Thus, Neisser’s model describes at a very abstract level how the perception of the environment is influenced by background knowledge, which in turn is updated by the perceived stimuli. In Neisser’s model, schemata represent an individual’s knowledge about the environment. Schemata are based on previous experiences and
Figure 1. The Perceptual Cycle after Neisser. (Adapted from Farris, 2003)
162
Perceived Quality in Game Audio
are located in the long term memory. Neisser attributes to them the generation of certain expectations and emotions that steer our attention in the further exploration of our environment. The exploratory process consists, according to Neisser, in the transfer of sensory information (the stimulus) into the short-term memory. In the exploratory process, the entirety of stimuli (the stimulus environment) is compared to the schemata already known. Recognized stimuli are given a meaning, whereas unrecognized stimuli will modify the schemata, which will then in turn direct the exploratory process further (Goldstein, 2002, Farris, 2003). Returning to the area of games, the differences in schemata between human individuals cause the same stimulus to provoke different reactions in different game players. Following Neisser’s model, new experiences (those that cause a modification of existing schemata) are especially likely to generate a higher load in terms of processing requirements. Schemata therefore also control the attention that we pay toward stimuli. The exploratory process is directed in the same way for multi-modal stimuli as for unimodal stimuli.
selective Attention An unmanageable number of studies have tried to identify and describe the strategies that are actually used in the human perceptual process. Pashler (1999) gives an overview and identifies two main concepts of attention: attention as based on exclusion (gating) or based on capacity (resource) allocation. The first concept defines the mechanism that reduces processing of irrelevant stimuli to be attention. It can be regarded as a filtering device that keeps out stimuli from the perceptual machinery that performs the recognition. Attention is therefore identified with a purely exclusionary mechanism. The second concept construes the limited processing resource (rather than the filtering device) as attention. It suggests that when attention is given
to an object, it is perceptually analyzed. When attention is allocated to several objects, they are processed in parallel until the capacity limits are exceeded. In that case, processing becomes less efficient or eventually impossible. Neither of the two concepts can be ruled out by the many investigations performed in the scientific community up to now. Instead, assuming either the gating or the resource interpretation, all empirical results can be explained in some way or other. As a result it must be concluded that both capacity limits and perceptual gating characterize human perceptual processing. This combined concept is termed controlled parallel processing (CPP). It claims that parallel processing of different objects is achievable but optional. At the same time, selective processing of a single object is possible, largely preventing other stimuli from undergoing full perceptual analysis. In fact, further conceptualizing attention might not even be possible unless we understood the neural circuitry and operations that underlie these processes in detail. Rather, in the context of bimodal perception it is interesting whether there are separate perceptual attention systems associated with different sensory modalities or whether a unified multi-modal attention system exists. Are visual and auditory attention the same thing? According to Pashler (1999), investigations have shown that humans are capable of selecting visual stimuli in one location in space and auditory stimuli in another. Spence, Nicholls, and Driver (2001) have examined the effect of expecting a stimulus in a certain modality upon human performance. They measured the reaction time to a stimulus located in the auditory, visual, or tactile modality between different frequencies of occurrence (equal number of targets in all modalities against a 75% majority of targets located in one modality). Spence et al. report that reaction times for targets in the unexpected modalities were slower than for the expected modality or no expectancy at all. They further state that shifting attention away from the
163
Perceived Quality in Game Audio
tactile modality was taking longer than shifting from the auditory or visual modality. These results show that performance not only depends on what actually happens, but also on what is anticipated by a game player. Yet, it must also be noted that in this study a faster response time for the most likely modality was always related to priming from an event in the same modality on the previous trial, and not to the expectancy as such. Alais and Blake (1999) have found evidence that attention focused on a visual object markedly amplifies neural activity produced by features of the attended object. They applied single-cell and neuroimaging studies and reinforce that visual attention modulates neural activity in several areas within the visual cortex. They state that “attentional modulation seems to involve a boost in the gain of responses of cells to their preferred stimuli, not a sharpening of their stimulus selectivity” (p. 1015). These findings clearly indicate that the perceptual process is actually controlled by attention. They can not fully answer the question whether there is one multi-modal attention or whether attentions are associated with modalities. However, there are indicators that favor the latter.
Divided Attention and Perceptual capacity Limits One of these indicators is that capacity limits appear to be more severe when multiple stimuli are presented in the same modality compared with multiple modalities (Pashler, 1999; Reiter, Weitzel, and Cao, 2007; Reiter & Weitzel, 2007; Reiter, 2009). This means that capacity limits may occur earlier and more frequently if the main task and the so-called distractors (stimuli that are not directly related to the task/the direct focus of attention) are located in the same modality. In an overview article, Lavie (2001) examines the capacity limits in selective attention. Lavie reasserts and concludes what evidence from several studies suggests: that selective attention as discussed in the previous section can either
164
result in selective perception (concept of gating or early selection) or in selective behavior (resource allocation or late selection). Most importantly, she argues that the choice of mechanism actually applied depends on the perceptual load. At low perceptual load, irrelevant information continues to be processed—early selection fails and late selection becomes necessary. When the perceptual load is high, irrelevant information is not processed and resource allocation is no longer needed. She cites a number of experimental studies that support these conclusions: processing of distractors ceases when the perceptual capacity is exhausted. Interestingly, Lavie claims that distractor processing depends on perceptual capacity limits, rather than on limited information contained in the relevant stimuli. This makes the MLE model second-rank in importance: In the MLE model, limited information contained in the relevant stimuli should entail the processing of additional cues among the distractors to check for reliability of that limited information and the correctness of its interpretation. Following Lavie, this is either not possible when the perceptual load is high, or attention needs to be shifted to formerly irrelevant information.
PErcEPtUAL sALIENcE AND sALIENcE MODEL Landragin, Bellalem, and Romary (2001) suggest that in the absence of information about the history of an interactive process, a (visual) object can be considered salient when it attracts the user’s visual attention more than the other objects. This definition of salience originally valid for the visual domain can easily be extended to what might be called multi-modal salience, meaning that: •
certain properties of an object attract the user’s general attention more than the other properties of that object
Perceived Quality in Game Audio
•
certain objects attract the user’s attention more than other objects in that scene.
A salience model in the game context requires a user model of perception, as well as a task model. The user model describes familiarity of the game player with the objects’ properties, as attention on the properties of an object may vary with background and experience of the player. Whereas an avatar of a human being or a human speech utterance can be considered more or less equally salient to all players (because its significance to humans is embedded genetically), an acoustically trained person might focus more on the reverberation in a virtual room than a visually oriented person. The task model describes the fact that salience depends on intentionality: depending on the task a player is given, his focus will shift accordingly. Salience also depends on the physical characteristics of the objects themselves. In the auditory domain it is known that certain noises with increased measures of properties like sharpness or roughness call the attention more than others (Zwicker & Fastl, 1999). Adding to this, salience can be due to spatial or temporal disposition of the objects in a scene. One of the most interesting aspects of a salience model in the context of computer games is its dependency on the degree of interactivity that the game offers to the player. If the player is allowed to interact freely with the objects in a virtual scene, then it is quite easy to determine the player’s focus. Obviously, the player’s focus will be on the object he is currently manipulating, so there is a clear indication of where to create a higher agreement of modalities. Consequently, games with fewer interaction possibilities are less likely to provide a sense of being there to the player. Thus, interactivity is important for the perceived realism of games in two different ways: first, it allows the player to do something in the virtual world, and second, it allows the application to determine the player’s momentary focus. This information can then be used to enhance the
audio-visual appearance of the object in focus, for instance, by making the sound (effects) related to that object more realistic in terms of acoustic details, frequency range, localization and so on.
salience Model Obviously, there are situations in which the game engine has no or only limited information about the player’s current focus. In these cases, it appears to be useful to have a salience model classifying the objects contained in the game scene. No such generalized multi-modal salience model exists, yet. For the rather limited scope of a gaming situation, a qualitative salience model is suggested here. The salience model comprehends the influence factors that control the level of saliency of the objects in a game scene. Figure 2 shows how such a salience model may be structured: it is reasonable to start from the basis of human perception, the stimuli. In games, stimuli are generated by the game system itself, so they depend on a number of factors—the influence factors of level 1. These comprise the audio and visual reproduction setups, as well as input devices used for player feedback to (and control of) the system, like keyboard, joystick, mouse, or any other dedicated input device. Influence factors of level 1 are those related to the generation and control of stimuli. The core elements of human perception are sensory perception on the one hand and cognitive processing on the other. Sensory perception can be affected by a number of influence factors of level 2. These involve the physiology of the user (acuity of vision and hearing, masking effects caused by limited resolution of the human sensors and so on), as well as other factors directly related to the physical perception of stimuli. Cognitive processing produces a response by the player. This response can be obvious, like an immediate reaction to a stimulus, or it can be an internal response like re-distributing attention/ shifting focus or just entering another turn of
165
Perceived Quality in Game Audio
Figure 2. A salience model for perceived quality in audio-visual games
the Perceptual Cycle. Obviously, the response is governed by another set of influence factors of level 3. These span the widest range of factors and are also the most difficult to quantify: experience, expectations, and socio-cultural background of the player; difficulty of task in a specific game situation; degree of interactivity; and so forth Influence factors of level 3 are related to the processing and interpretation of the perceived stimuli. Cognitive processing will eventually lead to a certain quality impression that is a function of all influence factors of types 1–3. This quality impression cannot be directly quantified. It needs additional processing to be uttered in the form of ratings on a quality (or quality impairment) scale, as semantic descriptors and so on. The overall quality impression is, in turn, the result of evaluating single or combined quality attributes. For example, Woszczyk, Bech, and Hansen (1995) have developed a number of attributes that are believed to be relevant for an overall audio-visual quality impression: they organize these attributes (quality, magnitude, involvement, balance) into 4 dimensions of perception (motion, mood, space, action), resulting in a 4 by 4 matrix of quality criteria. Yet, a quantification of their impact is hardly possible as of now. This is because the
166
individual attribute’s weight not only depends on the audio-visual game scene under assessment (the stimuli), but also on the experimental methodology itself. An attribute that is explicitly asked for might be assumed to be of higher importance by a test player (we know from our experience that only important things are asked for in any kind of test). The player’s attention will be directed toward the attribute under assessment, which distorts unbiased perception of the audio-visual scene as a whole. Therefore, the player’s reaction in terms of quality rating can be assumed to be influenced as well.
Experimental results A number of experiments have shown that player interaction with an audio-visual game might have an effect on the perceived overall quality (Jumisko-Pyykkö, Reiter, and Weigel, 2007; Reiter et al., 2007; Reiter & Weitzel, 2007; Reiter & Jumisko-Pyykkö, 2007; Reiter, 2009). In these experiments, the general assumption was that by offering an attractive interactive content, or by assigning the user a challenging task, this user would become more involved and thus experience a subjectively higher overall quality. Along the
Perceived Quality in Game Audio
same lines, it was hypothesized that the subject’s ability to differentiate between different levels of quality would decrease with an increase in difficulty of task/degree of interaction. The results show that this is not generally the case. However, when both task and main varying quality attribute were located in the same modality, such an effect could be observed. More specifically, in the first experiment (Jumisko-Pyykkö et al., 2007; Reiter and JumiskoPyykkö, 2007) subjects were presented with a scenario located in a virtual sports gym. In the center of the gym, a loudspeaker was positioned that played back music/speech signals with varying amounts of reverberation (time and strength). Subjects were asked to rate the quality of reverberation under three different degrees of interaction: 1.
2.
3.
No interaction (watch task): subjects were automatically moved on a pre-defined motion path through the virtual scenario Limited interaction (watch and press button task): subjects were moved on a pre-defined motion path through the virtual scenario, but were asked to press a button whenever a certain object appeared within their field of view Full interaction (navigate and collect task): subjects were asked to move freely through the scenario by using the computer mouse and to collect as many objects as possible by approaching them.
Interestingly, the ability of subjects to rate the quality of reverberation correctly did not vary with the degree of interaction/difficulty of the task (Friedman Χ2=3.3, df=2, p>0.05, ns). Although subjects claimed to have experienced more difficulties in the interactive tasks, this did not show in the statistical analysis of the collected data. Three possible explanations were looked at. The first was that the quality differences were too obvious, that is, the steps between the different amounts of reverberation were too big. This is
possible but was not regarded as probable, given the results of informal experiments with a similar variation in reverberation. The second, was that the tasks (pressing a button, and navigating/ collecting objects) were not demanding enough and that it was too easy for subjects to dedicate part of their attention towards the quality-rating task. This was contradicted by the claims of the subjects themselves: a large majority claimed to have been distracted by the navigation task. The third possible explanation was subsequently looked at in further experiments: The additional cognitive load (pressing a button, navigating while collecting objects) was located in the visual and haptic domains, whereas the quality differences to be rated were located in the auditory domain. In a second round of experiments (compare Reiter et al., 2007; Reiter, 2009), both the additional cognitive load and the quality variations were located in the auditory domain. A virtual room (replica of the entrance hall of a large university building) was equipped with a virtual loudspeaker in the center, and subjects were asked to navigate freely through the room using a computer mouse. The loudspeaker played back a randomized sequence of numbers from 1 to 4 read out loud. The reverberation time of the room acoustic simulation could be adjusted between 1.0s and 3.0s in 0.5s steps, with 2.0s considered the “reference” reverberation time. In the experiment, the reverberation time was changed from reference to another value at a single random point in time during a transition time frame beginning 5 seconds after the start and ending 5 seconds before the end of each 30 second trial. A modified Degradation Category Rating scale according to Recommendation ITU-T P.911 (1998) was used, consisting of 5 levels (much shorter, shorter, equal, longer, much longer), to have subjects compare the test reverberation time with the reference reverberation time. The additional cognitive load consisted of a so-called n-back working memory task, similar to what has been introduced by Kirchner (1958). Here, subjects were asked to semantically compare
167
Perceived Quality in Game Audio
Figure 3. Presented stimuli and correct answers (“Comparison”) for 1-back and 2-back continuousmatching-tasks
the current stimulus (the current number) with the one presented n steps back, see Figure 3. In the experiment, n was varied between 0 (no additional load) and 2 (high additional load). The hypothesis was, again, that with increasing difficulty of the task, subjects would commit more errors in correctly rating the reverberation time as a measure of perceived quality. Here, for the statistical analysis, the rating errors were restructured according to flaw size, such that each 0.5s deviation would result in one error point. The subsequent analysis was performed on error points. A complete description of the experiment can be found in (Reiter, 2009, pp. 203-212). A comparison of the error rates for “navigation only” with “navigation with 2-back task” resulted in a highly significant difference (T=20, p≤0.01). Comparing these results to the first experiment described above, it becomes apparent that innermodal influence of task is significantly greater than cross-modal influence. This might indicate that humans perform a pre-processing of stimuli that—depending on modality – takes place in separate areas of the brain. Thus, in situations where stimuli that belong to different modali-
168
ties have to be processed at the same time, we are better able to parallelize and distribute the processing accordingly. This is also suggested by the common theories of capacity limits in human attention, see above.
Game Example In a third experiment (Reiter & Weitzel, 2007), inspired by Zielinski, Rumsey, Bech, de Bruyn, and Kassier (2003) and Kassier, Zielinski, and Rumsey (2003), it has been shown that crossmodal influence of interaction is very well possible when stimuli and interaction/task are carefully balanced. For this, a simplified Space Invaderslike arcade game has been created, in which two different types of objects (donuts and snowballs) moved through a virtual room. Motion of objects was straight towards the baseline, on which the player could move left and right. Players were instructed to collect as many donuts as possible and to avoid collisions with snowballs. Each collected donut resulted in an increase of the player’s score whereas a collision with a snowball decreased the score. The current game score was displayed
Perceived Quality in Game Audio
Figure 4. Grey-scale screenshot of the game scenario
on the screen near the chimney, which served as the source of the flying objects. Figure 4 shows a screenshot of the game scenario. A typical background music track for an arcade game was chosen for the game. For the experiment, each subject carried out a passive and an active session. The active session involved playing the computer game and evaluating the sound quality of the game music. This session was designed to cause a division of attention between evaluating the audio quality and reaching a high score. In the passive session, subjects were asked to evaluate the audio quality while a game demo was presented. Here, the attention of the subjects was assumed to be directed to the audio quality exclusively. In both sessions, active or passive, either the original (20kHz) game music, or a low-pass filtered version with cut-off frequencies at fc = 11kHz, 12kHz or 13kHz was played. This was complemented by an anchor with fc = 4kHz. Thus a total of 5, 3-test items, 1 anchor item, and 1 reference item (corresponding to the original full-range signal) were presented to the players in the experiment. After each round of the game,
players were asked to rate the perceived tonal quality degradation using the standardized ITUT P.911 (Recommendation ITU-T P.911, 1999), 5-level impairment scale. A total of 32 subjects participated in the experiment. Seven players were female and 25 were male (age M = 25.7, SD = 5.36). Regarding their listening experience, 20 subjects were considered initiated assessors and 12 classified as naive assessors. The group of initiated assessors had already gained abilities and knowledge in rating audio quality in preceding unimodal and bimodal subjective assessments. All participants reported normal hearing and normal or corrected to normal visual acuity. A Wilcoxon T test showed that the quality ratings of the active session varied significantly from the ratings of the passive session for cut-off frequencies up to 12kHz. A significant decrease in rating correctness was shown for the active session in comparison to the passive session for the anchor item (T = 37, p ≤ 0.01), the cut-off frequency fc = 11kHz (T = 452.50, p ≤ 0.01), and the cut-off frequency fc = 12kHz (T = 812, p ≤ 0.01). For the cut-off frequency of 13kHz and the
169
Perceived Quality in Game Audio
reference item, no significant differences could be found (T = 630.50 and T = 75, resp., p > 0.05, ns). The data analysis showed that the ratings of the tonal quality degradations in the active session differed significantly from those in the passive session. The low-pass filtering in the active session was rated as being less perceptible compared to the passive session, for which active players turned into passive viewers. More generally speaking, the experiment shows that an influence of interaction performed in one modality (visual-haptic) upon the perception of quality in another modality (in this case, auditive) is possible. Thus, cross-modal influences are possible. In order for a cross-modal influence to exist, the characteristics of stimuli and interaction/task must be carefully balanced. At this time, it is not possible to determine or quantify that balance a priori. However, some of the influence factors that contribute to this balance have been identified in the salience model in Figure 2 above. These influence factors need to be quantified and this is a task for the future.
sUMMArY AND cONcLUsION This chapter has reviewed some of the most important issues of perceived quality of audio in computer games. The main conclusion is that audio quality in games, as perceived by a game player, is not independent of other factors (apart from sound quality itself). Because games usually provide information and feedback to the player in more than the auditory modality, it is necessary also to take into account other modalities when judging the impact and quality of audio. A rating of audio quality alone, without the gameplay context, is not meaningful. The physical mechanisms of human auditory and visual perception are well understood. Crossmodal interaction between the two domains, that is, perceptual processing in the human brain, needs further research, before it is possible to model such
170
processes. Still, whether it is possible to come up with a generalized model of cross-modal perceptual processing at all is highly questionable. It is assumed by many that its complexity exceeds by far the possibilities for designing a suitable model. Yet, it seems feasible to aim at perceptual models that are valid for certain perceptual scenarios only. A specific game-playing scenario can be one of these, as factors like setup (computer screen, loudspeakers/headphones, input devices) and task are of rather small variance across users, given a certain use case. This has been demonstrated in the game example above. A salience model as described in this contribution could therefore serve as a starting point for the exploitation of salience effects. Saliency is closely related to distribution of attention and perceptual capacity limits. The experimental results summarized in this chapter indicate that effects of capacity limits are more dominant inner-modally than cross-modally. At the same time, capacity limits seem to be more predictable inner-modally than cross-modally. Unless we have better models of the perceptual processing underlying the generation of a subjective quality impression, it will be difficult to predict the perceived quality of audio in a multimodal context in general, or in a game context as discussed here. Nevertheless, both the experiments described, and the literature and effects reviewed here, suggest that there is potential for exploitation of such perceptual constraints. Future research should therefore concentrate on methodologies for the subjective evaluation of audio-visual quality, or multi-modal quality in general. Only a few recommendations exist for performing audio-visual experiments and the impact of interactivity—as naturally given in any gameplay—on the perceived quality is, until now, simply not considered at all. Once proper recommendations exist, it will be much easier to compare and validate experimental results, thus paving the way for a quantification of the salience model described in this chapter.
Perceived Quality in Game Audio
rEFErENcEs Alais, D., & Blake, R. (1999). Neural strength of visual attention gauged by motion adaptation. Nature Neuroscience, 2(11), 1015–1018. doi:10.1038/14814 Battaglia, P. W., Jacobs, R. A., & Aslin, R. N. (2003). Bayesian integration of visual and auditory signals for spatial localization. Journal of the Optical Society of America, 20(7), 1391–1397. doi:10.1364/JOSAA.20.001391 Beerends, J. G., & De Caluwe, F. E. (1999). The influence of video quality on perceived audio quality and vice versa. Journal of the Audio Engineering Society. Audio Engineering Society, 47(5), 355–362. Blauert, J. (2001). Spatial hearing: The psychophysics of human sound localization (3rd ed.). Cambridge, MA: MIT Press. Braasch, J. (2005). Modelling of binaural hearing . In Blauert, J. (Ed.), Communication acoustics (pp. 75–108). Berlin: Springer Verlag. doi:10.1007/3540-27437-5_4 Brodmann, K. (1909). Vergleichende Lokalisationslehre der Grosshirnrinde in ihren Prinzipien dargestellt auf Grund des Zellenbaues. Leipzig, Germany: Johann Ambrosius Barth Verlag. Eijkman, E., & Vendrik, J. H. (1965). Can a sensory system be specified by its internal noise? The Journal of the Acoustical Society of America, 37, 1102–1109. doi:10.1121/1.1909530 Ellis, S. R. (1996). Presence of mind... A reaction to Thomas Sheridan’s “Musing on telepresence.” . Presence (Cambridge, Mass.), 5, 247–259. Farnell, A. (2011). Behaviour, structure and causality in procedural audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global.
Farris, J. S. (2003). The human interaction cycle: A proposed and tested framework of perception, cognition, and action on the web. Unpublished doctoral dissertation. Kansas State University, USA. Gardner, W. G. (1992, November). A realtime multichannel room simulator. Paper presented at the 124th meeting of the Acoustical Society of America. Gescheider, G. A., Sager, L. C., & Ruffolo, L. J. (1975). Simultaneous auditory and tactile information processing. Perception & Psychophysics, 18, 209–216. Goldstein, E. B. (2002). Wahrnehmungspsychologie (2nd ed.). Berlin: Spektrum Akadem. Verlag. Howard, I. P. (1982). Human visual orientation. New York: Wiley. Jot, J. M., & Chaigne, A. (1991). Digital delay networks for designing artificial reverberators. Paper presented at the AES 90th Convention. Preprint 3030. Jumisko-Pyykkö, S., Reiter, U., & Weigel, C. (2007). Produced quality is not perceived quality— A qualitative approach to overall audiovisual quality. In Proceedings of the 3DTV Conference. Kalman, R. E., & Bucy, R. S. (1961). New results in linear filtering and prediction problems. Journal of Basic Engineering, 83, 95–108. Kassier, R., Zielinski, S., & Rumsey, F. (2003). Computer games and multichannel audio quality part 2—Evaluation of time-variant audio degradation under divided and undivided attention. AES 115th Convention. Preprint 5856. Kirchner, W. K. (1958). Age differences in shortterm retention of rapidly changing information. Journal of Experimental Psychology, 55(4), 352–358. doi:10.1037/h0043688
171
Perceived Quality in Game Audio
Klein, D. J., König, P., & Körding, K. P. (2003). Sparse spectrotemporal coding of sounds. EURASIP Journal on Applied Signal Processing, 7, 659–667. doi:10.1155/S1110865703303051
Meehan, M., Razzaque, S., Whitton, M. C., & Brooks, F. P., Jr. (2003). Effect of latency on presence in stressful virtual environments. In Proceedings of IEEE Virtual Reality, 141-148.
Landragin, F., Bellalem, N., & Romary, L. (2001). Visual salience and perceptual grouping in multimodal interactivity. In Proceedings of International Workshop on Information Presentation and Natural Multimodal Dialogue IPNMD.
Nordahl, R. (2005). Self-induced footsteps sounds in virtual reality: Latency, recognition, quality and presence. In Proceedings of PRESENCE 2005, 8th Annual International Workshop on Presence, 353-354.
Larsson, P., Västfjäll, D., & Kleiner, M. (2003). On the quality of experience: A multi-modal approach to perceptual ego-motion and sensed presence in virtual environments. In Proceedings of First ISCA ITRW on Auditory Quality of Systems AQS-2003, 97-100.
Pashler, H. E. (1999). The psychology of attention. Cambridge, MA: MIT Press.
Lavie, N. (2001). Capacity limits in selective attention: Behavioral evidence and implications for neural activity . In Braun, J., & Koch, C. (Eds.), Visual attention and cortical circuits (pp. 49–60). Cambridge, MA: MIT Press. Lee, K. M., Jeong, E. J., Park, N., & Ryu, S. (2007). Effects of networked interactivity in educational games: Mediating effects of social presence. In Proceedings of PRESENCE2007, 10th Annual International Workshop on Presence, 179-186. Lee, K. M., Jin, S. A., Park, N., & Kang, S. (2005). Effects of narrative on feelings of presence in computer/video games. In Proceedings of the Annual Conference of the International Communication Association (ICA). Lombard, M., & Ditton, Th. (1997). At the heart of it all: The concept of presence. Journal of Computer-Mediated Communication, 3. MacKenzie, I. S., & Ware, C. (1993). Lag as a determinant of human performance in interactive systems. In Proceedings of the ACM Conference on Human Factors in Computing Systems – INTERCHI’93, 488-493.
172
Pulkki, V. (2001). Spatial sound generation and perception by amplitude panning techniques. Unpublished doctoral dissertation. Helsinki University of Technology, Finland. Recommendation ITU-T P.911. (1998/1999). Subjective audiovisual quality assessment methods for multimedia applications. Geneva: International Telecommunication Union. Reiter, U. (2009). Bimodal audiovisual perception in interactive application systems of moderate complexity. Unpublished doctoral dissertation. TU Ilmenau, Germany. Reiter, U., & Jumisko-Pyykkö, S. (2007). Watch, press and catch—Impact of divided attention on requirements of audiovisual quality . In Jacko, J. (Ed.), Human-Computer Interaction, Part III, HCI 2007 (pp. 943–952). Berlin: Springer Verlag. Reiter, U., & Weitzel, M. (2007). Influence of interaction on perceived quality in audiovisual applications: Evaluation of cross-codal influence. In Proceedings of 13th International Conference on Auditory Displays (ICAD), 380-385. Reiter, U., Weitzel, M., & Cao, S. (2007). Influence of interaction on perceived quality in audio visual applications: Subjective assessment with n-back working memory task. In Proceedings of AES 30th International Conference.
Perceived Quality in Game Audio
Schroeder, M. R. (1962). Natural sounding artificial reverberation. Journal of the Audio Engineering Society. Audio Engineering Society, 10(3), 219–223. Schroeder, M. R. (1970). Digital simulation of sound transmission in reverberant spaces (part 1). The Journal of the Acoustical Society of America, 47(2), 424–431. doi:10.1121/1.1911541 Shams, L., Kamitani, Y., & Shimojo, S. (2000). What you see is what you hear. Nature, 408, 788. doi:10.1038/35048669 Shams, L., Kamitani, Y., & Shimojo, S. (2002). Visual illusion induced by sound. Brain Research. Cognitive Brain Research, 14, 147–152. doi:10.1016/S0926-6410(02)00069-1 Sheridan, T. B. (1994). Further Musings on the Psychophysics of Presence. Presence (Cambridge, Mass.), 5, 241–246. Shiffrin, R. M., & Grantham, D. W. (1974). Can attention be allocated to sensory modalities? Perception & Psychophysics, 15, 460–474. Singer, W., Engel, A. K., Kreiter, A. K., Munk, M. H. J., Neuenschwander, S., & Roelfsema, P. R. (1997). Neuronal assemblies: necessity, signature and detectability. Trends in Cognitive Sciences, 1(7), 252–261. doi:10.1016/S13646613(97)01079-6 Spence, C., Nicholls, M. E. R., & Driver, J. (2001). The cost of expecting events in the wrong sensory modality. Perception & Psychophysics, 63(2), 330–336. Steuer, J. (1992). Defining virtual reality: Dimensions determining telepresence. The Journal of Communication, 42(4), 73–93. doi:10.1111/j.1460-2466.1992.tb00812.x Tulving, E., & Lindsay, P. H. (1967). Identification of simultaneously presented simple visual and auditory stimuli. Acta Psychologica, 27, 101–109. doi:10.1016/0001-6918(67)90050-9
Wenzel, E. M. (1998). The impact of system latency on dynamic performance in virtual acoustic environments. In Proceedings of the 15th International Congress on Acoustics and 135th Meeting of the Acoustical Society of America, 2405-2406. Wenzel, E. M. (1999). Effect of increasing system latency on localization of virtual sounds. In Proceedings of the AES 16th International Conference on Spatial Sound Reproduction, 42-50. Wenzel, E. M. (2001). Effect of increasing system latency on localization of virtual sounds with short and long duration. In Proceedings of 7th International Conference on Auditory Displays (ICAD). 185-190. Woszczyk, W., Bech, S., & Hansen, V. (1995). Interactions between audio-visual factors in a home theater system: Definition of subjective attributes. AES 99th Convention. Preprint 4133. Zielinski, S., Rumsey, F., Bech, S., de Bruyn, B., & Kassier, R. (2003). Computer games and multichannel audio quality—The effect of division of attention between auditory and visual modalities. In Proceedings of the AES 24th International Conference on Multichannel Audio, 85-93. Zwicker, E., & Fastl, H. (1999). Psychoacoustics—Facts and models (2nd ed.). Berlin: Springer Verlag.
KEY tErMs AND DEFINItIONs Binaural: Literally means “having or relating to two ears”. Binaural hearing, along with frequency cues, lets humans determine the direction of incidence of sounds. Brodmann Areas: 52 different regions of the cortex, defined on the basis of the organization of cells. Named after Korbinian Brodmann’s maps of cortical areas in humans, published 1909.
173
Perceived Quality in Game Audio
CognitivE Load: A term describing the load on working memory during instruction (problem solving, thinking, reasoning). Dorsal Stream: Also known as the parietal stream, the “where” stream, or the “how” stream, proposed to be involved in the guidance of actions and recognizing where objects are in space. Fitts’ Law: A model of the human movement in human-computer interaction and ergonomics which predicts that the time required to rapidly move to a target area is a function of the distance to and the size of the target. Localization: The ability to detect the direction of incidence of a sound. Monaural: Literally means “having or relating to one ear”. Multi-Modal: More than one perceptual modality involved, usually the auditory and the visual domain, sometimes also including haptics. Perceptual Cycle: A model describing human perception as a cyclic setup of schemata, perceptual exploration, and stimulus environment which influence each other in a continuously updated process, first introduced by US psychologist Ulric Neisser.
174
Presence: The feeling of being present in an artificial environment, for example a game scenario in a jungle. Quality of Experience: The overall acceptability of an application or service, as perceived subjectively by the end-user. Quality of Experience includes the complete end-to-end system effects (client, terminal, network, services infrastructure and so on). Overall acceptability may be influenced by user expectations and context. Salience: The state or quality of an item that stands out relative to neighboring items. Schema: Previous knowledge, something we already understand or are familiar with. Single-Cell Recording: A technique used in brain research to observe changes in voltage or current in a neuron, thus measuring a neuron’s activity. Space Invaders: An arcade video game designed by Tomohiro Nishikado, released in 1978, with the aim of defeating waves of aliens with a cannon, earning as many points as possible. Ventral Stream: Also known as the “what” stream, associated with object recognition and form representation.
Section 3
Emotion & Affect
176
Chapter 9
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games Paul Toprac Southern Methodist University, USA Ahmed Abdel-Meguid Southern Methodist University, USA
AbstrAct This chapter provides a theoretical foundation for the study of how emotions are affected by game sound as well as empirical evidence for determining how to promote fear, suspense, and anxiety in players using sound effects. Four perspectives on emotions are described: Darwinian, James-Lange, cognitive, and social constructivist. Three basic properties of diegetic sound effects were studied: volume, timing, and source. Results strongly suggest that the best sound design for causing fear is high volume and timed sound effects (synchronized game sound with visual moment) and somewhat suggest that sourced sound effects also promote fear. For anxiety, results strongly suggest that the best sound design is medium volume sound effects. Results also suggest that acousmatic and untimed sound effects evoke suspense rather than anxiety. Low volume sound effects are not effective at evoking fear, suspense, and anxiety due to potential masking by other sounds. Implications and future research directions are presented.
INtrODUctION Computer games are audio-visual entertainment media that provide an escapist experience (Grimshaw, 2007). That is, computer games utilize both audio and visual media to capture players’ attention and engage players’ motor and mental skills; thus immersing the players in the gameworld. This immersion provides an escape for players from DOI: 10.4018/978-1-61692-828-5.ch009
everyday life. Immersion occurs when the game: (1) “monopolizes the senses” (Carr, 2006, p. 68), (2) engages the player psychologically, and (3) requires physical action (see Nacke & Grimshaw, 2011 for more on immersion). The authors of this chapter believe that all three components of immersion are highly linked and can be (and are) used to evoke emotions from players. Visuals and sound are often used to elicit specific emotions among the consumers of computer games. Currently, however, the computer game
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
industry is focused on the quality of the graphics within the game. The computer game industry has clear guidelines for visuals, but not particularly for sound. Yet, sound is as least as important, if not more important, than visuals for creating immersion and evoking emotions (Anderson, 1996; Grimshaw, 2007), though often underrated by the players (Cunningham, Grout, & Picking, 2011). Sound can change the player’s perception of images to the point where the sound dominates even when the player is presented an opposing relationship between the sound and image (Collins, Tessler, Harrigan, Dixon, & Fugelsang, 2011). Unfortunately, as Collins (2007) states, “work into the sonic aspects of audio-visual media has neglected games [and] video games audio remains largely unexplored” (p. 263). Furthermore, as Serafin (2004) wrote, “[s]o far no quantitative results are available to help designers to build soundscapes which allow the user to feel fully immersed” (p.4). And, finally, according to Nacke and Grimshaw (2011), “not much work has been put into sensing the emotional cues of game sound in games, let alone in understanding the impact of game sounds on players’ affective responses”. The purpose of the current chapter is to create a theoretical foundation and empirical evidence for the study of how emotions and affect are impacted by game sound. Although Roux-Girard (2011) “firmly believes that adopting a position that emphasizes reception issues of gameplay can provide a more productive model than one that would be grounded directly in the production aspects (implementation and programming) of game audio”, we believe that researching the impact of the production aspects of game sounds is just as productive. Ultimately, we believe that both approaches are equally viable and should be used to understand the experience of game sounds. Whereas Roux-Girard attempts to understand the effect of game sounds from a top-down approach, our intent is to build from bottom-up a research foundation upon which further inquiry into the relationship between emotions and game sound
can be conducted. Furthermore, our aim is to produce valid results that are able to both explain phenomena and be useful for game designers. Specifically, this chapter describes a study to determine the best sound design principles pertaining to game sound effects (defined here as all diegetic game sound except dialogue) to cause fear and anxiety in players—two common emotions that players feel while playing computer games. The empirical research examines how to manipulate three basic properties of game sound (volume, timing, and source) through a game level designed to evoke fear, suspense, and anxiety. Through this quantitative and qualitative examination, the general design principles of how to develop game sound effects to promote fear and anxiety is better understood.
bAcKGrOUND: LItErAtUrE AND FIELD rEVIEW In order to design games and perform research using game sound for promoting fear, suspense, and anxiety, both theories of emotion and the current state of the art design of sound effects in games are important to understand. Emotions and affect are elusive in nature, and difficult to define (Cornelius, 1996). For instance, some consider emotions and affect to be the same psychological construct, while other researchers consider affect to be the conscious experience of emotions. In either case, our research measures the conscious experience of emotion, whether that is considered affect or emotion or both. Furthermore, rather than define emotion and affect, which is attempted in Nacke and Grimshaw (2011), we will describe emotions from the perspectives of four theoretical traditions of research on emotion in psychology (Cornelius, 1996). These schools of thought on the sources and development of emotion are the Darwinian theory of emotions, James-Lange theory, cognitive theory, and social constructivist theory. Our intent is to provide an understanding of the emergence
177
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
of emotions while playing a game, which Perron (2005a) termed gameplay emotions. Gameplay emotions are different than everyday emotions. For instance, gameplay emotions can be paradoxical in nature, such as deliberating watching a scary movie to enjoy the sensation of fear (Cunningham et al., 2011). The following discussion on emotions is focused on fear, anxiety, and suspense.
theories of Emotions and Games The Darwinian Perspective of Emotions In the Darwinian perspective of emotions, there are certain basic emotions that are inherited and shared across the human experience. Researchers of the Darwinian perspective, such as Plutchik (1984), have identified several primary emotions, such as rage, loathing, grief, amazement, terror, admiration, ecstasy, and vigilance. Each of these emotions has several levels of intensity. For instance, the less intense levels of the emotion of terror are fear and then apprehension. Game players can be observed showing many of these identified primary emotions. For example, players often feel fear or apprehension at the appearance of the enemy, particularly in survival horror computer games. Plutchik’s theory posits that we can promote fear in everyone. Fear is a psychological experience to prepare individuals for the ‘freeze, fight or flight’ response (Gray, 1971). However, Plutchik’s theory does not easily account for anxiety. Fear is a reaction to a specific danger or threat while anxiety is unspecific, vague, and objectless (May, 1977). Thus, anxiety is not a lower level of intensity of fear, or even apprehension. Anxiety is diffuse with a vague sense of apprehension (Kaplan & Sadock, 1998), rather than apprehension due to a specific stimulus (Gullone, King, & Ollendick, 2000). Anxiety is often thought to be a future-oriented mood--a vague discomforting sense--that things will go wrong, which can have an adaptive function of enhancing performance at optimal levels
178
(Barlow, 1988). May (1977) resolves whether anxiety is innate or not by suggesting that all humans have the instinctive capacity to react to threats, whether the threat is concrete (for fear) or unspecific (anxiety). However, what the individual considers threats may be learned and are triggered by the appraisal of particular events or stimuli.
The Cognitive Perspective of Emotions The importance of appraisals of particular events or stimuli, and their associations with emotions, is illuminated by the cognitive theory of emotions (Cornelius, 1996). Based on the cognitive perspective, emotions and behavior are constantly changing as an individual appraises and reappraises the changing environment (Folkman & Lazarus, 1990). Depending on what the player of a game is consciously thinking of a situation, he or she can experience any of a range of emotions and behaviors. Appraisals and reappraisals are important parts of the emotional experience of survival horror games (Perron, 2004). Game players may feel fear and anxiety by appraising particular sounds as being “scary” or “creepy” or they may appraise the same game sounds as “silly.” Game designers can promote the experience of fear and anxiety through priming cues, such as music, acousmatic sound effects, and visuals, which can encourage “thinking” about the scariness or creepiness of the game. After being primed, the player is more likely to appraise particular stimuli in the way that is desired by the game designer, such as fear when suddenly a monster appears in survival horror computer games.
The James-Lange Perspective of Emotions Appraisals are also an important part of the James-Lange perspective of emotions. However, in this perspective the appraisals are unconscious evaluations of the body’s response to stimuli (Cornelius, 1996). While playing games, the body
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
reacts and provides feedback to the brain, which unconsciously appraises the body’s reaction and influences further cognition. For instance, if the player jumps due to a stimulus, the mind may attribute that bodily action to being scared. Research by Wolfson and Case (2000) showed that louder sounds in a computer game increased heart rate and impacted physiological arousal and attention. According to the James-Lange perspective, the player may unconsciously attribute increased heart rate and arousal as a state of fear and/or anxiety. The game designer can use this knowledge to his or her advantage. Loud and sudden noises, for example, can make players instinctively jump, which promotes the feeling of fear. However, it is less clear how to use this perspective for eliciting anxiety. If players sweat profusely while playing a game, will he or she attribute that to fear or anxiety or something else? Perhaps the answer depends on the stimuli, which leads some researchers back to the Darwinian and cognitive perspectives of emotions, where Darwinians believe that our emotional reactions to stimuli are innate and cognitivists believe that they are learned.
The Social Constructivist Perspective of Emotions Are our reactions to particular stimuli innate or learned? If learned, how can game designers know what stimuli to use to promote fear and anxiety? Although there is some controversy regarding the answer to the former question, there do seem to be a few, select stimuli that innately promote fear such as sudden, unexpected movements, especially approach-motions, or sounds (Gebeke, 1993) and, for anxiety, the threatened security patterns between an individual and his or her valued significant persons (May, 1977). However, beyond that, it would seem that our reactions to particular events or stimuli are learned, and this leads to the question of how and what is learned. The response to this is best answered by the social constructivist perspective of emotions.
Fear and anxiety responses to particular stimuli are learned through conditioning from family and other valued persons, which, in turn, are part of the larger general culture (May, 1977). Socialconstructivists believe that emotions are used to maintain interpersonal relationships and identity in a person’s communities (Greeno, Collins, & Resnick, 1996). The community can be friends, relatives, or other game players, who are all influenced by the general culture. For instance, people often feel scared when they suddenly see a cockroach because that is what they have learned from their mothers, who feared and loathed cockroaches. If they did not feel fear, but rather liked cockroaches, their relationship with their mothers may have become strained, which, at a young age, would not be desirable. Thus, these youths appropriate the feeling of fear of cockroaches from their mothers, who, in turn, maintain this fear because it is part of the cultural milieu in which the mothers desired to participate. Finally, as Cunningham, Grout, and Picking (2011) point out, the social and cultural milieu includes the context in which the person is experiencing his or her emotion. Thus, emotions that emerge while playing games (that is, gameplay emotions) are different from everyday emotions because the context of playing games is not the same as the context of typical everyday experiences. For game designers, this means that anything that the player has been taught to fear can be leveraged to promote fear, including such things as death and failure, which are risks (to the player’s avatar) in most computer games. Furthermore, game designers can use sound effects as cues for threats. For example, if a player gets close to an electrical hazard, the game designer can add a loud sparking noise to scare the player. However, game designers must keep in mind that particular graphics and sound may elicit different or less intense emotions between individuals in different cultures. For instance, the sound of a slide-action shotgun pumping may promote more fear or
179
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
anxiety in America than in countries where there are fewer shotguns. Though “response to sound, therefore, can vary from player to player” (Collins et al., 2011), the theories of emotion described in this section provide a framework to understand the sources and range of emotional responses of players to game sound. In the Darwinian perspective, there are certain basic emotions that are inherited and shared across the human experience. The cognitive perspective to emotions contributes to understanding them by explicating the importance of appraisals of stimuli and their underlying association with emotions. Researchers of the James-Lange theory of emotions believe that humans first experience bodily changes as a result of the perception of the emotion-eliciting stimuli and that is the experience of the feeling. Furthermore, the social constructivist perspective posits the theory that emotions are learned and culturally determined. These emotion theories correspond to three generally accepted forms of human expression of emotions: expressive behavior (showing the emotion), subjective experience (appraising the feeling), and the physiological component (sympathetic arousal) (see Cunningham et al., 2011). As previously mentioned, anxiety, at first glance, could be conceptualized as a less intense experience of fear, but this is not considered by most emotion researchers to be the case. Fear and anxiety are closely related but not the same (Gullone, King, & Ollendick, 2000). Fear is an emotional response to a particular event or object and anxiety is an emotional response to an unspecific event or object. Though fear and anxiety are considered as two separate mental processes, representing different affective and cognitive states, the two are considered linked. Fear can feed off anxiety and vice-versa. Game designers may be able to increase players’ fear if the players are already anxious rather than in a state of calm. Because the player is already in a state of nervousness and worry, he or she may perceive
180
a threat to be more dangerous than warranted, resulting in an elevated fear response. So, what is the emotion of suspense? Compared to fear and anxiety, there has been very little research on suspense. However, psychologists quoted in Paradox of Suspense (Carroll, 1996) provide this definition of suspense: “…a Fear emotion coupled with the cognitive state of uncertainty” (p. 78). That is, fear coupled with anxiety. The film scholar, Zillman (1991), describes suspense as “the experience of uncertainty regarding the outcome of a potentially hostile confrontation” (p. 283), which is similar to the definition of anxiety but with more emphasis on specific stimuli that are associated with fear. Thus, we conclude that suspense is the intersection or overlap of fear and anxiety. Suspense can be viewed as fear of imminent threat that is likely to occur, but has not appeared, and/or a state of high anxiety due to an impending dangerous situation. As Krzywinska (2002), a professor in film studies, states: “Many video games deploy sound as a key sign of impending danger, designed to agitate a tingling sense in anticipation of the need to act” (p. 213). Fear, anxiety, and suspense are gameplay emotions that are intentionally promoted in the design of survival horror games. Game designers control all that the player sees and hears within the survival horror game experience, and they have used this control to develop sound design techniques to elevate the player’s fear, suspense, and anxiety. Some of these techniques are explained in the following section.
sound Design in Games Sound design is used in almost all computer games. To design the soundscape in a computer game, there are a large number of sound properties that can be manipulated (see Liljedahl, 2011; Wilhelmsson & Wallén, 2011). In this chapter, sound properties are reduced to three independent variables: volume, timing, and source. We believe that these are three of the most basic properties
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
of sound that are considered when designing the soundscape of a computer game. Volume is the relative loudness at which a sound is heard from a loudspeaker. Timing is the relative synchronization of the sound with its source. Source is the origin of a sound. Game sounds are here categorized into the common topology of three separate categories: music, dialogue, and sound effects (Wilhelmsson & Wallén, 2011). Music is a type of mood-setting technique that typically coincides with the theme of a game. Music is not considered as part of this study because it is typically non-diegetic in nature and music is “heavily controlled by tempo” (Cunningham et al., 2011) rather than the sound properties studied here. Dialogue is diegetic but is not considered in this chapter because properties of speech, such as intonation, may be more important than the selected properties of volume, timing, and source in their impact on players. Sound effects are diegetic game sounds, such as ambient, weapon, and environmental sounds. Examples of sound effects are: ambient noises such as rustling leaves and the steady drip of rain; player avatar sounds that are not related to dialogue, such as pained grunts; and weapon noises, such as the crack of a rifle or the swing of a club. To understand how the sound properties of sound effects are used by game designers to cause fear and anxiety, the state of the art in sound design for games is reviewed in this section. Specifically, computer games in the survival horror genre are reviewed using the sound properties of volume, timing, and source. Survival horror games provide good case studies because they are designed to keep the player in a state of fear, suspense, and anxiety throughout the game: “Crawling with monsters, survival horror games make wonderful use of surprise, attack, appearances, and any other disturbing action that happens without warning” (Perron, 2004, p. 2). The games chosen for this field review are: Alone in the Dark (2008), Dead Space (2008), Doom 3 (2004), Eternal Darkness (2002), and Silent Hill 2 (2001). The following
provides an overview of the sound design of the five different survival horror games selected: •
•
•
•
•
Alone in the Dark: This game uses high quality visuals coupled with interspersed moments of surprise to cause player fear and anxiety. Dead Space: This game has an abundance of ambient sound effects and clutter to add to the realism and increase the player’s anxiety and suspense. Dead Space also uses a combination of well-timed and high-volume sound effects to elicit fear responses from the player, and has received praise for its sound design. Doom 3: The soundscape of Doom 3 focuses on voice acting and ambient sound effects. The ambience succeeds in creating a mood of suspense, while the encounters with monsters focus on creating fear. Eternal Darkness: This game takes a minimalistic approach to sound design and only uses sounds very sparingly. This approach allows the player to hear what few sounds are in the game with little difficulty and this increases the effect of each sound. Silent Hill 2: This is an older game that uses a minimalist approach to sound design, like Eternal Darkness, but with more of an emphasis on game sounds without visible sources to create suspense.
Volume of sound Effects While game designers decide on what level of volume to play their sound effects relative to other sounds, players change the overall volume of the game sounds emitting from their loudspeakers at will. Thus, the game designer can only manipulate the magnitude of volume in relation to other sounds. A “loud” sound has a higher volume than the average sounds currently playing. A “soft” sound has a lower volume than the average. Psychoacoustic research suggests that the lower
181
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
in volume a sound effect is, the more likely that players will miss the sound (Healy, Proctor, & Weiner, 2003). For game designers, this means they should create important sound effects that are at least the same volume, if not louder, than the ambient soundscape in order to be perceived. Loud sound effects are more likely to be effective at evoking sudden and shocking emotions in the player than soft sounds. Softer sounds, however, may serve as a good atmospheric tool that can enhance immersion and set a mood. Computer games typically play ambient sound effects at low to medium volume, while emotion-evoking sound effects play at medium to high volumes to maximize the likelihood that the player perceives them. For instance, in Alone in the Dark, the ambient sound effects such as electrical sparks and raging fires are typically abrupt and louder than the background music but soft enough that they do not drown out other important sounds like dialogue and combat sound effects. The ambient sound effects of Dead Space consist of steam vents leaking, garbage rustling, and lights sparking, and are all low to medium volume. In these cases, it is not clear whether the ambient sounds solely promote greater immersion and/or promote anxiety. In Dead Space, the player’s interaction with the monsters is the most important part of the game, and thus, the sound effects related to these interactions are the loudest. For instance, any time the player engages in combat, the monsters screech loudly and very discernibly until they die. The scream instantly tells the player that his or her avatar is in danger and, because the monsters can kill the player’s avatar, these sound effects can cause fear in the player. All of the ambient sounds and music in Doom 3 are loud: they mask out almost all other game sound. The enemy sounds are quieter than the ambient noise, which causes the enemies to seem less menacing. The only thing consistently louder than the music and the ambient noise is the player’s gun and avatar’s pain screams. Doom 3
182
seems to focus more on visual quality than sound quality. Yet, one section in Doom 3 that stands out among the rest occurs when a screaming, flying skull circles around the player’s avatar, and its volume rises and falls based on its distance from that avatar. This part of the game causes fear due to the perceived danger from the sudden and loud sounds accompanied by the mysterious nature of the flying skulls, though over time the fear subsides as the player becomes more habituated to the situation. Furthermore, Doom 3 has an enemy ambush almost every time the player’s avatar picks up an item. The game has the same sound and the same enemy for many of these ambushes. This eventually becomes repetitive and boring, and players begin to anticipate the ambushes. Eternal Darkness and Silent Hill 2 use the technique of “less is more,” where a few high volume sound effects scattered throughout evoke more fear and anxiety than many high volume recurring sound effects that may eventually cause habituation, as in Doom 3. The use of volume with sound effects varies depending on whether the game designer is attempting to evoke fear or anxiety. Based on the above review of games, high volume, abrupt sound effects seem to be more effective at causing fear, while low to medium volume ambient sound effects may be more effective at creating suspense and anxiety by convincing the player that they are in a dangerous circumstance.
timing of sound Effects Game designers decide for each sound effect one of three alternatives for timing: (1) the sound effect is timed to coincide with a corresponding, often visible event or object, (2) the sound effect and the corresponding event or object lag each other, or (3) the sound effect is played without regard to corresponding specific event(s), that is, untimed. Thus, timing can be conceptualized as the degree of synchronization between the sound effect and visible object(s) (see Roux-Girard, 2011). For
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
instance, when an enemy ambush in a game begins, an appropriate sound effect accompanies the ambush (the game sound is highly synchronized with the visible event), such as a door swinging open or glass shattering. The intended purpose of these synchronized sound effects can be to surprise or startle the player, which can promote fear. While many survival horror games use ambushes regularly, such as the majority of encounters in Dead Space and Doom 3, Silent Hill 2 seldom uses ambushes to scare the player. Rather, this game does the opposite by playing a radio static sound loop, emanating from the avatar’s pocket radio, to warn players of nearby enemies. The player quickly learns that the white noise is a forewarning of an imminent attack. This lagging technique, coupled with the extremely limited visibility in the game, causes players to search for the source of the static whenever they hear it. Players know that a dangerous situation is nearby, which often causes players to feel suspense. This forewarning is an emotional and cognitive cue for problem solving (Perron, 2004). Untimed environmental sound effects are present in Alone in the Dark and Silent Hill 2. In these games, sounds such as crackling fire, whistling wind, or shaking earth are seemingly set to play at random.
source of sound Effects According to psychoacoustic theories, humans judge whether a sound comes from an appropriate source by the visible availability of a source and whether or not that source could sensibly create that sound (Healy, Proctor, & Weiner, 2003). Almost all games have clearly visible sources for their sound effects, such as the sound of an attacking enemy. Providing a visible source of sound helps the player determine what to do within the game and helps the player navigate through the game by listening (Grimshaw, 2008), and enhances the player’s avatar survival prospects (Roux-Girard, 2011). For instance, the player can listen for the location of monsters. Furthermore, ambient game
sounds help to immerse the player by bridging the reality gap between the game and real physical environments (Liljedahl, 2011). Some examples of ambient sound effects are leaking ventilation shafts in Alone in the Dark, sparking electrical wires in Dead Space, and rustling leaves in Eternal Darkness, which are all visible to the player when played. These ambient sound effects also help set the mood of the game for players, which may encourage players to appraise objects and events in the game as scary. In Silent Hill 2, however, most ambient sound effects have no visible source. Players of Silent Hill 2 are unable to find the source of these sounds (that is, acousmatic sounds), such as babies crying, discordant wind chimes clanging together, and tricycle bells ringing. Furthermore, monsters’ sounds are mixed at low volume within the nondiegetic music (Roux-Girard, 2011). These sound production techniques add a strong air of mystery and ambiguity between sound generators (RouxGirard, 2011), which may cause anxiety. Based on the literature and field review, our experimental hypothesis for designing sound effects for fear and anxiety is as follows: high volume, synchronized sound with the corresponding visual stimulus, and visibly sourced sound effects are more effective at creating fear, and low to medium volume, scary, eerie or mysterious acousmatic sound effects are more effective at creating anxiety, with no difference between timed and untimed sound effects for anxiety. The authors believe that if the source of the sound effect can be seen by the player then the synchresis of timing with the visible threat becomes salient, promoting veridicality (Collins et al., 2011; Roux-Girard, 2011) and resulting in the player feeling fear. If the source cannot be seen, that is, acousmatic sounds, then synchresis is not achieved and the player cannot determine the relationship between what the player sees and what the player hears: this should promote anxiety. If the sound effect is not timed with the visible threat, then the sound effect will probably be ignored and will not promote
183
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
fear or anxiety. Furthermore, timed sound effects-synchronized game sound and visual threat--will not promote anxiety because the threat becomes a clear stimulus. Thus, timing should not promote anxiety. Fear and anxiety were measured in the experiment discussed below, rather than suspense, because fear and anxiety are considered separate emotions, whereas suspense is the overlap of these emotions, which could confound the interpretation of the results. Our hypothesis was tested using the methodology described in the next section.
EXPErIMENtAL EVIDENcE The hypothesis, as stated in the previous section, was tested using a survival horror game level in Gears of War (2007), which was created in Unreal Editor 3. During each test subject’s play-through the participant heard one randomly selected alternative (wolf howl, gunfire, or wretch growl) for the volume test, one randomly selected alternative (thunder, boomer growl, or creaking door) for the timing test, and one randomly selected alternative (locust growl, glass shattering, or footsteps) for the source test. Both quantitative data using 7-point self report surveys and qualitative data were gathered and analyzed. Although there can be issues with “after-the-fact narration” (Nacke & Grimshaw, 2011) by participants completing self report surveys and interviews, the use of these indirect measures are common approaches to data gathering in research of emotions. Thirtyfour participants in the U.S.A., ten females and twenty-four males participated in the study. The average participants’ age was 26 years old. The average playing time per week was about eleven hours, and fifteen participants (approximately 44%) liked playing survival horror games. (For a full exposition of the methodology and results, see Amdel-Meguid, 2009).
184
causing Fear Findings Results showed a statistically significant (p < 0.05) and large difference (η2 > 0.14) in fear due to the volume of sound effects between low volume sound and high volume sound, as well as between medium volume sound and high volume sound. No meaningful qualitative data was gathered for volume related fear responses. For timing, results showed a statistically significant and very large (see eta square, Cohen, 1988) increase between timed and untimed sound effects. Qualitative data showed that timed sound effects enhanced the fear of many players when accompanied along with a visual gameplay element, such as the presence of an in-game enemy, though the sound effect by itself may not have substantially elicited fear. Findings for fear related to sourced sound effects appeared to be considerable but they were not statistically significant. Several participants verbally reported that the acousmatic sound effects, such as a breaking window or footsteps on the ceiling, evoked fear. In particular, participants reported that the acousmatic sound effect required them to be attentive to possible threats.
causing Anxiety Findings Results showed a statistically significant and large difference in anxiety due to the volume of sound effects between low volume sound and medium volume sound, as well as between low volume sound and high volume sound. No meaningful qualitative data was gathered for volume related anxiety responses. There was not a statistically significant difference between timed and untimed sound effects for anxiety. Qualitative data showed that untimed sound effects caught some players off guard, because they could not determine whether the sounds were meant to signal danger, or if they were benign sounds.
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
There was not a statistically significant difference between sourced and acousmatic sound effects. Some participants reported that they felt anxiety when they could not find the source of a sound effect. Other participants stopped and looked around when an acousmatic sound played, as they did for untimed sound effects, and proceeded to search for the source of the sound effect. One player stated that not finding the source of an untimed sound effect made him worry that he had possibly missed something important.
Discussion Causing Fear Discussion Results strongly suggest that high volume sound effects are most effective at causing fear in players. The quantitative data showed a significant, large increase in reported fear responses, when a sound effect is louder than the rest of the sounds in the game. This implies to game designers that the louder they create a sound effect, relative to other sounds, the more effective it is at promoting fear in a player. In addition, results strongly suggest that sound effects timed to coincide with a visual gameplay element, such as an in-game enemy, are effective at eliciting fear. Quantitative data showed a significant, large increase in fear responses due to timed sound effects compared to untimed sound effects. However, the qualitative data showed that players may not have reacted with fear to the sound itself, but, rather, fear was primarily evoked by the accompanied visual gameplay element. That is, a well-timed sound effect amplifies attention to the gameplay element and enhances the initial fear response caused by the visual perception of that element. The synchronization of the sound and corresponding image enhanced the feeling of fear through the process of synchresis, which promoted veridicality. This implies to game designers that accompanying a visual gameplay element with a well-timed, appropriate sound effect is more effec-
tive at causing fear in players than by introducing the gameplay element without a sound effect or with a mistimed sound effect. Finally, there were mixed results of whether sourced sounds elicit more fear than acousmatic sound effects. Quantitative data did not show a significant increase in fear responses to sourced sounds compared to acousmatic sound effects. However, qualitative data suggested that an acousmatic sound effect drew attention to a potentially imminent danger in the game, which may have put players in a state of suspense. If this is the case, some participants may not have reported that sourced sound effects evoked significantly more fear than acousmatic sounds because they considered their suspense responses to be closer to fear than anxiety.
Causing Anxiety Discussion Results showed that medium and high volume sound effects are significantly and substantially more effective at eliciting anxiety in players than low volume sound effects. Perhaps low volume sound effects are not easily perceived because they become masked amidst other higher volume sounds. In contrast, high volume sound effects are easily perceived but not necessary, because there is no significant difference between medium and high volume sound effects for evoking anxiety. Furthermore, given that high volume sound seems to elicit fear reactions from players, the use of this sound technique for evoking anxiety should be avoided, because of its potential confounding effect. This implies to game designers that the best volume to play anxiety-causing sound effects at is at the same volume as the average soundscape in the game. Low volume sound effects, perhaps, should be used to immerse the player by generating the ambience and mood of the game (Roux-Girard, 2011) rather than to promote specific emotions. The quantitative results did not show a significant change in anxiety between timed and untimed sound effects. Qualitative results indicated that
185
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
some players were concerned by the untimed sounds. However, this concern in finding the nature of the untimed sound effect seems to be more about not knowing the source of the sound effect rather than the timing. Likewise, the quantitative results did not show a significant difference between sourced and acousmatic sound effects in evoking anxiety. Nevertheless, the qualitative results indicate that when some players heard an acousmatic sound, they stopped and looked around in an attempt to find the source. Being unable to find the source caused some players to become concerned that something dangerous would occur later in the game, which could mean that the players were in a state of suspense. And, as noted previously, players may have considered suspense to be more of a fear emotion than anxiety, which would have resulted in less reporting of anxiety. The overall implication to game designers is that playing a threatening or eerie sound effect without a visual source may be better at causing suspense in players than accompanying a sound with a visual threat. Furthermore, untimed sound effects can also promote suspense if the player perceives it as independent from the visual source, at which point the player would appraise the sound effect as acousmatic.
cONcLUsION The aim of the current chapter was to provide a theoretical foundation for the study of evoking emotions using sound design and determine how to cause fear and anxiety through sound design in computer games. The literature and field review that focused on human emotion theory and survival horror games provided an understanding of basic sound design principles of volume, timing, and source in relation to the emotions of fear and anxiety. This study used qualitative and quantitative methods to determine the best use of volume, timing, and source of diegetic sound effects to cause fear and anxiety in players.
186
The results of this study strongly suggest that the best sound design for causing fear are high volume sound effects that are well-timed with the accompanying visual element. This may seem obvious but this study has provided statistical validity for using this technique and these results can be used as basis for further research. For anxiety, results strongly suggest that the best sound design is medium volume sound effects. Furthermore, qualitative data suggest that suspense was evoked by untimed and acousmatic sound effects. And, although results suggested that medium sound effects were able to promote anxiety, players may have been in a state of suspense at this time, as well. Low, acousmatic sound effects appear to not be effective at evoking fear and anxiety, and possibly any emotion, due to their tendency to become masked by other sounds. Perhaps low volume sound effects may be best used for enhancing immersion or mood. An interesting interpretation of the current study’s evidence is that anxiety, as a separate gameplay emotion, is difficult to evoke on its own. Rather, the combination of fear and anxiety, that is, suspense, is easier to promote, and probably more desirable. Players play survival horror games to experience fear and suspense (Perron, 2005b). Anxiety is too diffuse and vague to be compelling for players to experience in survival horror games. Players of these games would rather have a more direct and powerful emotional response to perceived events and gameplay. This chapter provided quantitative and qualitative evidence that game designers can manipulate the sound properties of volume, timing, and source to evoke fear, suspense, and anxiety in players. The literature and field review, methodology, and results of this study can serve as a foundation for future research.
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
FUtUrE rEsEArcH DIrEctIONs One of the limitations of this study is that most participants were male and tended to be heavy gamers. Future studies may want to focus on populations that better represent female and/or casual gamers. Another limitation is that all the sound effects were included in one level and there was possible interaction between the volume, timing, and source variables For instance, the volume of a sound effect may influence the emotional impact of timing and/or source. In order to control for these possible confounding factors, a future study may focus on the affect of solely one sound property. Furthermore, as in most studies, a larger population should yield better external validity. One possible research direction is to continue studying fear, suspense, and anxiety beyond the parameters of the study described in this chapter. For instance, are the sound design techniques the same for causing fear, suspense, and anxiety in game genres other than survival horror? What is the difference between male and female responses to sound design techniques that cause fear, suspense, and anxiety? What are the effects of the absence of sound on fear, suspense, and anxiety? What is the relationship between visual gameplay elements and sound effects on how they affect players’ fear, suspense, and anxiety? How can other types of sounds, such as music and dialogue, increase fear, suspense, and anxiety? How do other sound properties affect fear, suspense, and anxiety? Finally, future possible research can study other emotions using the same or other sound properties. For example, how can game designers elicit the emotions of anger, joy, and sadness in players through sound design? This research would lead to inquiry into the questions raised previously, such as the effect of genre, visual gameplay elements, and the type of sound evoking the studied emotion. From this research, we would not only understand how to promote certain emotional experiences from playing computer games through the use of sound design, but we may be also able
to add new insights and dimensions to emotional theories, as well.
rEFErENcEs Alone in the Dark. (2008). Eden Games. Amdel-Meguid, A. A. (2009). Causing fear and anxiety through sound design in video games. Unpublished master’s thesis. Southern Methodist University, Dallas, Texas, USA. Anderson, J. D. (1996). The reality of illusion: An ecological approach to cognitive film theory. Carbondale, IL: Southern Illinois University Press. Barlow, D. H. (1988). Anxiety and its disorders: The nature and treatment of anxiety and panic. New York: Guilford Press. Carr, D. (2006). Space, navigation and affect . In Carr, C., Buckingham, D., Burn, A., & Schott, G. (Eds.), Computer games: Text, narrative and play (pp. 59–71). Cambridge, UK: Polity. Carroll, N. (1996). The paradox of suspense. In Vorderer & Friedrichsen (Eds.), Suspense: conceptualization, theoretical analysis, and empirical explorations (pp. 71-90). Hillsdale N.J.: Lawrence Erlbaum Associates. Cohen, J. W. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Collins, K. (2007). An introduction to the participatory and non-linear aspects of video games audio . In Hawkins, S., & Richardson, J. (Eds.), Essays on sound and vision (pp. 263–298). Helsinki, Finland: Helsinki University Press.
187
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
Collins, K., Tessler, H., Harrigan, K., Dixon, M. J., & Fugelsang, J. (2011). Sound in electronic gambling machines: A review of the literature and its relevance to game audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Cornelius, R. R. (1996). The science of emotion. Upper Saddle River, NJ: Prentice-Hall. Creswell, J. (2005). Educational research: Planning, conducting, and evaluating quantitative and qualitative research (2nd ed.). Upper Saddle River, New Jersey: Pearson Education. Cunningham, S., Grout, V., & Picking, R. (2011). Emotion, content and context in sound and music . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Dead Space. (2008). Electronic Arts. Doom 3. (2004). Activision. Eternal Darkness. (2002). Nintendo. Folkman, S., & Lazarus, R. S. (1990). Coping and emotion . In Leventhal, N. B., & Trabasso, T. (Eds.), Psychological and biological approaches to emotion (pp. 313–332). Hillsdale, NJ: Erlbaum. Gears of War. (2007). Microsoft. Gebeke, D. (1993). Children and fear. Retrieved December 10, 2009, from http://www.ag.ndsu. edu/pubs/yf/famsci/he458w.htm. Gray, J. A. (1971). The psychology of fear and stress. New York: McGraw-Hill. Greeno, J. G., Collins, A. M., & Resnick, L. B. (1996). Cognition and learning . In Berliner, D., & Calfee, R. (Eds.), Handbook of educational psychology (pp. 15–46). New York: Simon & Schuster Macmillan.
188
Grimshaw, M. (2007). Sound and immersion in the first-person shooter. In Proceedings of The 11th International Computer Games Conference: AI, Animation, Mobile, Educational & Serious Games (CGAMES 2007). Grimshaw, M. (2008). The acoustic ecology of the first-person shooter: The player experience of sound in the first-person shooter computer game. Saarbrucken, Germany: VDM Verlag. Gullone, E., King, N., & Ollendick, T. (2000). The development and psychometric evaluation of the Fear Experiences Questionnaire: An attempt to disentangle the fear and anxiety constructs. Clinical Psychology & Psychotherapy, 7(1), 61–75. doi:10.1002/(SICI)10990879(200002)7:1<61::AID-CPP227>3.0.CO;2-P Healy, A. F., Proctor, R. W., & Weiner, I. B. (2004). Handbook of psychology: Vol. 4. Experimental psychology. Hoboken, NJ: Wiley. Kaplan, H. I., & Sadock, B. J. (1998). Synopsis of psychiatry. Baltimore, MD: Williams & Wilkins. Krzywinska, T. (2002). Hands-on horror . In King, G., & Krzywinska, T. (Eds.), ScreenPlay: Cinema/Videogames/Interfaces (pp. 206–223). London: Wallflower. Liljedahl, M. (2011). Sound for fantasy and freedom . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Lincoln, Y. S., & Guba, E. D. (1985). Naturalistic inquiry. Thousand Oaks, CA: Sage Publications, Inc. May, R. (1977). The meaning of anxiety (revised ed.). New York: Norton. Nacke, L., & Grimshaw, M. (2011). Player-game interaction through affective sound . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global.
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
Perron, B. (2004). Sign of a threat: The effects of warning systems in survival horror games. In Proceedings of the Fourth International COSIGN (Computational Semiotics for Games and New Media) 2004 Conference.
ADDItIONAL rEADINGs
Perron, B. (2005a). A cognitive psychological approach to gameplay emotions. In Proceedings of the Second International DiGRA (Digital Games Research Association) 2005 Conference.
Bridgett, R. (2008). Post-production sound: A new production model for interactive media. Soundtrack, 1(1), 29–39. doi:10.1386/st.1.1.29_1
Perron, B. (2005b). Coming to play at frightening yourself: Welcome to the world of horror video games. In Proceeding of the Aesthetics of Play conference. Plutchik, R. (1984). Emotions: A general psychoevolutionary theory. Hillsdale, NJ: Erlbaum. Roux-Girard, G. (2011). Listening to fear: A study of sound in horror computer games . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Serafin, S. (2004). Sound design to enhance presence in photorealistic virtual reality. In Proceedings of the 2004 International Conference on Auditory Display. Silent Hill 2. (2001). Konami. Wilhelmsson, U., & Wallén, J. (2011). A combined model for the structuring of game audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey: IGI Global. Wolfson, S., & Case, G. (2000). The effects of sound and colour on responses to a computer game. Interacting with Computers, 13, 183–192. doi:10.1016/S0953-5438(00)00037-0 Zillman, D. (1991). The logic of suspense and mystery . In Bryant, J., & Zillman, D. (Eds.), Responding to the screen: Reception and reaction processes (pp. 281–303). Hillsdale, NJ: Lawrence Erlbaum Associates.
Barbara, S. C. (2003). Hearing in three dimensions. The Journal of the Acoustical Society of America, 113(4), 2200–2200.
Calvert, S. L., & Scott, M. C. (1989). Sound effects for children’s temporal integration of fast-paced television content. Journal of Broadcasting & Electronic Media, 33(3), 233–246. Gärdenfors, D. (2003). Designing sound-based computer games. Computer Creativity, 14(2), 111–114. Grimshaw, M. (2009). The audio uncanny valley: Sound, fear and the horror game. In Proceedings of the Audio Mostly 2009 Conference. Houlihan, K. (2003). Sound design: The expressive power of music, voice, and sound effects in cinema. Journal of Media Practice, 4(1), 69–69. doi:10.1386/jmpr.4.1.69/0 Izard, C. E. (2009). Emotion theory and research: Highlights, unanswered questions, and emerging issues. Annual Review of Psychology, 60(1), 1–25. doi:10.1146/annurev.psych.60.110707.163539 Jennett, C., & Cox, A. L. (2008). Measuring and defining the experience of immersion in games. International Journal of Human-Computer Studies, 66(9), 641–661. doi:10.1016/j.ijhcs.2008.04.004 Jones, K. (2004). Fear of emotions. Simu l a t i o n & G a m i n g, 3 5( 4 ) , 4 5 4 – 4 6 0 . doi:10.1177/1046878104269893 Jørgensen, K. (2007). On transdiegetic sounds in computer games. Northern Lights: Film & Media Studies Yearbook, 5(1), 105–117. doi:10.1386/ nl.5.1.105_1
189
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
Klimmt, C., Rizzo, A., Vorderer, P., Koch, J., & Fischer, T. (2009). Experimental evidence for suspense as determinant of video game enjoyment. Cyberpsychology & Behavior, 12(1), 29–31. doi:10.1089/cpb.2008.0060 Kofler, A. (1997). Fear and anxiety across continents: The European and the American way. Innovation: The European Journal of Social Sciences, 10(4), 381–404. Kyosik, K., & Hyungtai, C. (2008). Enhancement of a 3D sound using psychoacoustics. International Journal of Biological & Medical Sciences, 1(3), 151–155. Levitt, H. (1971). Transformed up-down methods in psychoacoustics. The Journal of the Acoustical Society of America, 40, 467–477. doi:10.1121/1.1912375 Liu, M., Toprac, P., & Yuen, T. (2008). What factors make a multimedia learning environment engaging: A case study . In Zheng, R. (Ed.), Cognitive Effects of Multimedia Learning. Hershey, PA: Idea Group Inc. Portnoy, S. (1997). Unmasking sound: Music and representation in The Shout and Blue. Spectator: The University of Southern California Journal of Film & Television, 17(2), 50–59. Raghuvanshi, N. (2007). Real-time sound synthesis and propagation for games. Communications of the ACM, 50(7), 66–73. doi:10.1145/1272516.1272541 Roberts, J. R. (2006). Influence of sound and vibration from sports impacts on players’ perceptions of equipment quality. Journal of Materials: Design & Applications, 220(4), 215–227. Robertson, H. (2004). Random noises. Videomaker, 19(4), 71–74.
190
Satoru, O., & Shigeru, A. (2003). Video game apparatus, background sound output setting method in video game, and computer-readable recording medium storing background sound output setting program. The Journal of the Acoustical Society of America, 114(3), 1208–1208. Schafer, R. M. (1994). The soundscape: Our sonic environment and the tuning of the world. Rochester, VT: Destiny Books. Sider, L. (2003). If you wish to see, listen: The role of sound design. Journal of Media Practice, 4(1), 5–15. doi:10.1386/jmpr.4.1.5/0 Tinwell, A., & Grimshaw, M. (2009, April). Survival horror games: An uncanny modality. Proceedings of the International Conference Thinking After Dark. Yantas, A. E., & Azcan, O. (2006). The effects of the sound-image relationship within sound education for interactive media design. Computer Creativity, 17(2), 91–99. Zwicker, E., & Fastl, H. (1990). Psychoacoustics facts and models. New York: Springer-Verlag.
KEY tErMs AND DEFINItIONs Anxiety: A generalized mood condition that occurs without an identifiable triggering stimulus. Cognitive Emotional Theory: Cognitive activity in the form of judgments, evaluations, or thoughts is necessary in order for an emotion to occur. Darwinian Emotional Theory: Emotions evolved via natural selection and therefore have cross-culturally universal counterparts. Fear: An emotional response to a perceived threat. James-Lange Emotional Theory: Emotional experience is largely due to the experience of bodily changes.
Causing Fear, Suspense, and Anxiety Using Sound Design in Computer Games
Psychoacoustics: The study of subjective human perception of sounds. Sound Design: The manipulation of audio elements to achieve a desired effect. Sound Principles: Components that most influence how sound is perceived.
Source: The object emitting sound. Suspense: Is a feeling of fear and anxiety about the outcome of certain actions. Timing: The degree of synchronization between the sound effect and visible object(s). Volume: The amplitude or loudness of a sound
191
192
Chapter 10
Listening to Fear:
A Study of Sound in Horror Computer Games Guillaume Roux-Girard University of Montréal, Canada
AbstrAct This chapter aims to explain how sound in horror computer games works towards eliciting emotions in the gamer: namely fear and dread. More than just analyzing how the gamer produces meaning with horror game sound in relation to its overarching generic context, it will look at how the inner relations of the sonic structure of the game and the different functions of computer game sound are manipulated to create the horrific strategies of the games. This chapter will also provide theoretical background on sound, gameplay, and the reception of computer games to support my argument.
INtrODUctION Computer game sound is as crucial to the creation of the depicted gameworld’s mood as it is in its undeniable support to gameplay. In horror computer games, this role is increased tenfold as sound becomes the engine of the gamer’s immersion within the horrific universe. From the morphology of the sound event to its audio-visual and videoludic staging, sound cues provide most of the information necessary for the gamer’s progression in the game and, simultaneously, supply a range of emotions from simple surprise to the DOI: 10.4018/978-1-61692-828-5.ch010
most intense terror. In horror computer games, it is not recommended that the gamer divert their attention from the various sound events, as a careful listening will allow for—or at least favour—the survival of their player character. In his thesis on the sound ecology of the first-person shooter, Mark Grimshaw (2008) underlined that in common day life, where dangers are limited, the auditory system “can operate in standby mode (or, in cognitive terminology, [the] auditory system is operating at a low level of perceptual readiness) awaiting more urgent signals as categorized by experience” (p. 10). Just as Grimshaw did about the genre at the heart of his study, I suggest that “the hostile world of the [horror computer] game
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Listening to Fear
requires a high level of perceptual readiness in regard to sound” (p. 10). The level of attention required vis-à-vis sound must be increased all the more so as computer game environments are often designed to limit the visual perception of the gamer. Whether it is by means of a constraining virtual camera system (Taylor, 2005), by using stylistic effect such as the thick fog shrouding the streets of Silent Hill (Konami, 1999), or by drastically reducing sources of light, game designers have, through time, found a variety of ways to force the gamer to utilise their ears in order to help their player character survive in the nightmarish worlds in which they play. To fully comprehend how horror computer games manage to frighten the gamer, one must understand how sound is structured, as well as be aware of how the gamer makes meaning with the information the sounds carry. From this point, many questions arise. What are the implications of the generic context on the reception of the sounds in horror computer games? On what basis should we approach the sound structure of those games? How does this structure allow for the mise en scène of the dreadful elements or horrific strategies of the games? What are the basic functions of horror computer game sounds and, once again, how can the game work on these functions to create a sentiment of fear and dread in the gamer? As it will be further explored in the next sections of this chapter, I make the hypothesis that sound in computer games should be approached directly in regard to its purposes towards gameplay. After all, gameplay is what mainly distinguishes computer games from their linear audio-visual counterparts: the main difference between computer games and films being situated in the participatory and interactive nature of the videoludic medium. Therefore, it is mainly through a study of gameplay that true understanding of the role of game sound can be achieved. In this perspective, I also suggest that sound should be addressed in a way that is both accessible to designers and the most common gamer. In order to do so, I firmly
believe that adopting a position that emphasizes reception issues of gameplay can provide a more productive model than one that would be grounded directly in the production aspects (implementation and programming) of game sound. Overall, this text aims at explaining how horror game sound works in a way to elicit specific emotions in the gamer. Adopting a gamer- and gameplay-centric perspective, it wishes to highlight how the inner relations of the sonic structure and the different functions of game sound are used to create strategies based on the micro events and on the overarching generic context that regulates these events. With examples borrowed from the Alone in the Dark (I-motion, 1992-1995, Infogrames, 2001 & Atari, 2008), Resident Evil (Capcom, 1996-2009) and Silent Hill (Konami, 1999-2008) series, and from the computer game Dead Space (Electronic Arts, 2008), this paper will also try to demonstrate how the notion of genre, instead of being merely a tool to classify games, rather impacts on the expectations of the gamer and therefore structures the way they organize and make meaning of sound in relation to the game context.1
APPrOAcHING HOrrOr cOMPUtEr GAME sOUND Before we try to understand what purposes sounds serve in horror computer games and how they contribute in generating fear, it is essential to take a look at the numerous factors which condition the gamer’s journey and influence their listening through their gaming sessions.
the Horizon of Expectations In her book Game Sound: An Introduction to the History, Theory, and Practice of Video Games, Karen Collins (2008) noted that “game [sound] has been significantly affected by the nature of technology […] and by the nature of the industry”
193
Listening to Fear
(p. 123). Indeed, economic and technological constraints are greatly responsible for the game’s aesthetic as the limits imposed by production time and hardware often force the designers to lessen the richness of the soundscape while encouraging others to find inventive ways to overcome these constraints.2 However, as Collins explained, the games themselves also affect game sound by the means of their genre, narrative structure, and participatory nature. Consequently, she pointed out that “[g]enre in games is particularly important in that it helps to set the audience’s expectations by providing a framework for understanding the rules of gameplay” (Collins, 2008, p. 123). Consequently, the horizon of expectations gamers have of the games is probably the first thing that will influence the production of meaning towards a sound. As Hans Robert Jauss (1982) explains: The analysis of the literary experience of the reader [or the videoludic experience of the gamer] avoids the threatening pitfalls of psychology if it describes the reception and the influence of a work within the objectifiable system of expectations that arise for each work in the historical moment of its appearance, from a pre-understanding of the genre, from the form and themes of already familiar works, and from the opposition between poetics and practical language. (p. 22) This horizon of expectations will thus be forged by the gamer’s previous experiences at playing computer games, particularly those in the horror genre, but also his familiarity with broader horror mythology and conventions such as the ones found in movies and novels. We can also maintain that the notion of genre will play a determining role in the way game sound is produced and then received by the gaming community. This relationship between production and reception is fundamental to understand the functions and evolution of sound in horror computer games. Indeed, these games are generically marked “which rely on generic identification by an audience” (Neale, 2000, p.
194
28) as well as generically modelled which “draws on and conforms to existing generic traditions, conventions and formulae” (Neale, 2000, p. 28).3 To be considered as a horror game, a videoludic work must then be designed with a purpose of scaring the gamer and must be received as such by the gaming community that will then treat this intention as a gaming constraint. Accordingly, sound must be exploited to support these design choices, and, to a certain degree, correspond to the expectations the games produce.
What is a Horror computer Game? Horror computer games generate fear through mechanisms specifically tied to their videoludic nature even though they often draw their strategies of mise en scène from its cinematographic counterpart’s conventions and mythologies (Whalen, 2004; Perron, 2004). Derived from the “adventure” genre (Whalen, 2004), these computer games exploit horror conventions on the plot level, often by opposing a lone individual, trapped inside a gloomy location, to a flock of bloodthirsty, monstrous creatures which he must confront—or sometimes run from—in order to survive. On the gameplay level, “the gamer has to find clues, gather objects [...] and solves puzzles” (Perron, 2004, p. 133). As it was mentioned previously, sound will play a determining role, as these games normally limit vision through their formal and aesthetic treatments, in helping the gamer to gather the necessary information on their environment to stay alive. Horror computer games are not only designed to generate fear based on their narrative setting or the iconography they employ, but are also conceptualised to produce what Bernard Perron called gameplay emotions. According to Perron (2004), these games engender three different kinds of emotions: (1) fictional emotions which “are rooted in the fictional world and the concerns addressed by that world”, (2) artefact emotion emanating “from concerns related to the artefact, as well as
Listening to Fear
stimulus characteristics based on these concerns”, but mostly (3) gameplay emotions “that arise from the gamer’s action in the game-world and the consequent reactions of this world” (p. 132). While all horror computer games are provided with a more or less elaborate fictional setting, in the end, it remains a part of the experience of gameplay. For horror games to be effective, gameplay mechanics must have been designed with the intent of scaring the gamer, by limiting the quantity of ammunition, for instance.
How to Approach Horror computer Games sound? In the introduction to Sound Theory, Sound Practice, Rick Altman (1992) claimed that rather than seeing cinema as a self-centered text, it should be perceived as an event. Traditionally, film studies modelled the production and reception as gravitating around the film-as-text. However, as Altman explained: “Viewed as a macro-event, cinema is still seen as centred on the individual film, but […] the textual center is no longer the focal point of a series of concentric rings” (pp. 2-3). Following this model, the film-as-text mostly serves as a “point of interchange” between the process of production and the process of reception which mutually influence one another. The film itself thus becomes a representation of this “dialogue” or this “event”. Computer games can be envisioned in a similar fashion. However, while technical aspects of computer game production might enlighten certain points about how the sounds are implemented and structured within the game code, I believe that it is not with regard to this code that horror computer games should be approached. While some PC games offer the option to look at how the files are organised on the disc, most computer games—particularly console games—do not. I will therefore be addressing sounds through the gameplay process of the individuals playing the game and all design matters will be dealt with in regard to creating this gameplay experience.
For this matter, the notion of genre will mostly serve as an overarching catalyst through which the gamer structures their journey in the games.4 Of course, the chapter will still deal with design issues such as looking at the implementation of sound strategies, however, this will be done in order to investigate how designers built these stratagems out of their predictions of how the gamer potentially produces meaning through the sounds, in regards to generic constraints, during their gameplay activity. But then again, what is gameplay? In HalfReal, Jesper Juul (2005) approached the concept of gameplay using Richard Rouse’s definition as a basis: “A game’s gameplay is the degree and nature of the interactivity that game includes, i.e., how the [gamer] is able to interact with the game-world and how that game-world reacts to the choices the [gamer] makes” (Rouse in Juul, p. 87). To further elaborate on the question of gameplay, and to prevent a misunderstanding of the term, Juul added that “gameplay is not a mirror of the rules of a game, but a consequence of the game rules and the dispositions of the game players” (p. 88). Using this quotation as starting point, and as a way to oppose the fallacy constructed by Manovich’s (2001) definition of an algorithm, Arsenault and Perron (2009) reminded me that “one of the misconceptions of gameplay which needs to be addressed springs out when one does not make a distinction between the process of playing games and the game system itself” (p. 110). Following their logic, gameplay must not be understood as “the” game system but as the “ludic experience” emerging from the relation that is established between the gamer and the game system. Therefore, it is important to understand that through the eyes of a gamer, the experience of gameplay is not portrayed by a series of codes managed by an algorithm, nor a direct representation of the implementation of sound within this code.5 Consequently, I chose to exploit a terminology which facilitates the understanding of the gamer’s cognitive process during
195
Listening to Fear
gameplay. It will allow me to better illustrate how the gamer produces meaning of sounds as a means of completing their main objective: the survival of their player character. Arsenault and Perron (2009) defined computer games as “a chain of reactions” in which “[t] he [gamer] does not act so much as he reacts to what the game presents to him, and similarly, the game reacts to his input” (pp. 119-120) In other words, the gamer responds to events that were programmed by a designer (whose job partly consisted of predicting the gamer’s reactions to the proposed events), and then the game acts in response to the gamer’s input with other preprogrammed events fitting the new parameters. According to their gameplay (and gamer-centric model), the authors explained (a single loop of) gameplay through four steps in which “the game always gets the first turn to speak” (Arsenault & Perron, 2009): •
•
•
•
From the game’s database, the game’s algorithm draws the 3-D objects and textures, and plays animations, sound files, and finds everything else that it needs to represent the game state The game outputs these to the screen, speakers, or other peripherals. The gamer uses his perceptual skills (bottom-up) to see, hear and/or feel what is happening The gamer analyses the data at hand through his broader anterior knowledge (in top-down fashion) of narrative convention, generic competence, gaming repertoire, etc. to make a decision The gamer uses his implementation skills (such as hand-eye coordination) to react to the game event, and the game recognizes this input and factors it into the change of the game state. (pp. 120-121)
However, as the authors recalled, “the most obvious flaw of representing gameplay with a single circle is that the temporal progression—the
196
evolution of the gamer’s relationship with the game—is left aside” (Arsenault & Perron, 2009, p. 115). To correct this failing, Arsenault and Perron proposed a model—the Magic Cycle (Figure 1)—that is based on 3 interconnected spirals: the heuristic spiral of gameplay, the heuristic spiral of narrative and the hermeneutic spiral. They also clarified that: “[t]he relationship to each other is one of inclusion: the gameplay leads to the unfolding of the narrative, and together the gameplay and the narrative can make possible some sort of interpretation” (p. 118). Their model also took into account the gamer’s experience in gaming and the horizon of expectations of the gamer that are shaped by their previous knowledge of the game or sometimes by an introductory cut scene. While looking at the model, these are respectively represented by the dotted lines entitled “launch window” and by the inverted spiral. From this point the looping process described above will be “repeated countless numbers of time to make up the magic cycle” (p. 121) and to represent the mental image the gamer develops about the game (represented by the Game′ of the model). This perpetual process, alongside the implication of the generic context, will therefore allow for the mental organisation of sounds towards the gamer activity inside the game.
strUctUrING HOrrOr cOMPUtEr GAME sOUND When they are engaged in a horror game, the exercise of gameplay requires the gamer to somewhat organise sounds according to their gaming objectives which, in the case of the genre we study here, mainly revolve around allowing their player character to survive the horrors of the game. In order to do so, the gamer tries to answer two basic questions regarding game sound: 1) From where does that sound originate? and 2) what is the cause of that sound? I therefore propose to explore a basic sound structure that will effectively represent the
Listening to Fear
Figure 1. Arsenault & Perron’s Magic Cycle. (© 2009, Arsenault and Perron. Used with permission)
cognitive process (as previously explained with Arsenault and Perron’s model) that is performed almost unconsciously by the gamer while playing a horror computer game.
Inside and Outside of the “Diégèse” While glancing at the game sound literature (Collins, 2008; Grimshaw, 2008; Huiberts and van Tol, 2008, Jørgensen, 2006, 2011; Stockburger, 2003), we notice that one of the most common ways to envision the structure and composition of sound in games is relative to its status regarding the diégèse of the game (I am using the French word to avoid any misconception that this term holds the same meaning as Plato’s and Aristotle’s definition of diegesis6). Taking its origin in film studies, the diégèse must be understood as a “mental reconstruction of a world” (Odin, 2000, p. 18, freely translated) that can be “perceived as an inhabitable space” (Odin, 2000, p. 23, freely translated). This definition of diégèse clearly refers to the “historico-temporal” universe in which the story—or in the case that interests us, the simulation—takes place. This definition thus allows more easily for the parallel that is often established between the diégèse and the game-
world.7 From a structural perspective, more than the description of a world, it is particularly in the division that exists between elements considered as being part of the fictional world (diegetic) and elements which are not judged to be components of the fictional world (extra-diegetic8) that this notion has found a niche in works on sound in game studies. Indeed, while listening to horror computer game sound, the fact that a sound is part of the depicted gameworld or not will have a considerable impact on the decisions the gamer will make regarding this sound. Based on the gameplay model that was introduced earlier, these sound cues will engender many questions in an attempt to recreate the mental image of the game state. Is the sound produced by an instance present in the “diégèse”? If it is, does that instance represent a threat to the player character or is it just a part of the ambience of the gameworld? Furthermore, as it was hinted by this set of queries, while the diegetic status of a sound holds much importance, recreating the mental image of the game state necessitates a more elaborate set of qualifiers.
197
Listening to Fear
sound Generators In computer games, much attention must be paid to sound sources as they contribute to the construction of the diegetic space. However, more important than what instance or event emits the sound is what generates the sound. Not only does the notion of generator furnish knowledge on what caused a specific cue, but it also provides information on its relationship to other sounds, its relationship to the game state, as well as the situation in which they are heard. These sound generators, as Kristine Jørgensen (2008) explained are “not the same as the source of the sounds. While the source is the object that physically (or virtually) produces the sound: the generator is what causes the event that produces the sound” (Player Interpretation of Audio in Context section, para. 2). If we adapt Jørgensen’s example to a horror computer game context, this basically means that the shrieking sound emitted by one of Dead Space’s necromorph (its source) while being dismembered by the player character’s plasma cutter is in fact generated by the gamer. Therefore, this concept (in its definition) also reflects the interactive nature of computer games by putting forward the agency9 of the gamer within the simulated world, as well as the response of the game to the gamer’s actions. While studying World of Warcraft, Jørgensen (2008) identified 5 categories of sound generators: the gamer, allies, enemies, the gameworld, and the game system each of which is organized according to the perspective of the gamer. Even though, some horror games propose an interaction with friendly non-player characters such as Luis in Resident Evil 4 (Capcom, 2004) or, as in Resident Evil 5 (Capcom, 2009) and Left 4 Dead (Valve Software, 2008), offer a multiplayer co-operative mode, most games of the genre privilege the solitude of the player character and allies are normally quite scarce. Therefore, this chapter will focus on the dynamic and nondynamic sounds (Collins, 2008) produced by the gamer, the enemies, the gameworld, and the game
198
system. Accordingly, I will briefly describe these generators following Jørgensen’s definition and adapt them to my own corpus of study. General informative functions of each type of generator will also be mentioned as they will provide a tighter relationship with the next section of this chapter on the functions of horror computer game sounds. A sound generated by the gamer is “caused by [gamer] action” (Jørgensen, 2008, Player Generated Sound section, para. 1). As Jørgensen explained: The most important informative role of [gamer] generated sounds is to provide usability information, or more specifically to provide response since they always seem to appear immediately after a player action. Player generated sounds also provide spatial information, and sometimes also temporal and [player character] state information. (Player Generated Sound section, para. 1) In Resident Evil (Capcom, 1996), for instance, these sounds may include footsteps, gunshots, the opening of doors, angry monster growls after they are shot by the gamer, the opening of Chris Redfield’s or Jill Valentine’s inventory menus and so on. For their part, enemy generated cues “are produced externally from the [gamer’s] perspective, by being detached from the [gamer’s] own actions and emerging from the gameworld” (Jørgensen, 2008, Sound Generated By Enemies and Allies section, para. 1). Such sounds will furnish spatiotemporal information and will also serve “presence” purposes as they engage with the existence of enemies in the vicinity. Of course, these sounds also give information about modification in the game state and supply progression functions of the game: these might include the sounds of offscreen or on-screen monsters, or may indicate that the player character has been wounded after being hit by a zombie. Gameworld generated sounds are similar to what Huiberts and van Tol (2008) described as
Listening to Fear
zone sounds. These sound cues consist of sounds “linked to the environment in which the game is played” (Huiberts & van Tol, 2008, Zone section, para. 1). While these sounds are often implemented to generate the ambience of the game, they also serve as spatial functions and might give certain information about the game state. In Dead Space, these sounds include the rumbling of the ship and some of the gruesome sounds emitted by the preprogrammed burst of blood coming out of organic matter that can be found on the wall and floor. Game system-generated sounds are by far the most ambiguous. Jørgensen (2008) defined them as sounds “generated by the system to provide information that any [player character] cannot produce on its own, and carry information directly connected to game rules and as well as game and [gamer] state” (Conclusions and Summary section, para.3). Horror computer games do not include many of those sounds. However, a few examples can be found. The “fuzzing” sound, accompanied by heart pounding, that is emitted when an player character is lethally wounded in Resident Evil 5 could correspond to this description as it is not directly produced by a gamer’s action, it is generated by the system to warn the gamer that his player character needs immediate health assistance. While it is not explicitly mentioned by Jørgensen, I would argue that the extra-diegetic musical score of the game is also system generated. While this music often plays an affective role in the game, it also serves presence and game state purposes. For instance, in Alone in the Dark: Inferno (Atari, 2008), the music ramps up, signalling that enemies are nearby or attacking the player character. It is mostly according to the relationship between this extra-diegetic music, the gamer, and the gameworld that this category of generators will be examined in this paper. These generators will be used as a structural basis when studying the creation of horror game sound strategies.
tHE FUNctIONs OF HOrrOr cOMPUtEr GAME sOUND To reach his objective, the gamers must also gather information about the game state. To do so, they must ask themselves what are the functions of a particular sonic cue and, if the sounds serve more than one purpose, which function is more important according to the context? In computer games, sounds contribute to the gamer’s immersion: they construct the mood of the game, and provide information that will be used in gameplay. According to Jørgensen (2006), we can state that sound serves two main functions. On one hand, it “has the overarching role of supporting a user system” and, on the other, it is “supporting the sense of presence in a fictional world” (p. 48). This basically means that sound creates “a situation where the usability information of elements such as [sound] becomes integrated with the sense of presence in the virtual world” (Jørgensen, 2008, Integration of Game System and Virtual World section, para. 1).
the (Double) causality of sound To fill the important functions exposed by Jørgensen, I believe that sounds first need to create a feeling of causality with: 1) the images (and more largely with the gameworld) and 2) with the gamer’s actions. Just like in movies, images and sounds are tightly linked, producing the effect of added value, described by Michel Chion (2003) as a “sensory, informative, semantic, narrative, structural or expressive value that a sound heard during a scene leads us to project on the image, until creating the impression that we see in this image what in reality we ‘audio-see’” (p. 436, freely translated). The added value of a sound on the images creates what Chion called audio-visiogenic effects which can be classified within four categories: (1) effect of sense, atmosphere, content, (2) rendering and matter effect (materializing sound indices)
199
Listening to Fear
which creates sensations of energy, textures, speed, volume, temperature, for example, (3) scenography effect concerning the creation of an imaginary space and (4) effect related to time and the construction of a temporal phrasing. These audio-visiogenic effects and materializing sound indices are essential to horror computer games such as Dead Space, as they give an organic texture to an anthropomorphic monster. The gooey sound that accompanies the impact of a plasma cutter blast as blood and guts explode on the screen helps the gamer believe that what they are seeing is real, while in fact what is showed on the screen is a simple translation of coloured polygons. The effectiveness of the added value rests upon 3 factors that have also been defined by Chion. It is principally by means of synchronisation points, “a more salient moment of a synchronised reunion between concomitant sonic moment and visual moment” (p. 433, freely translated) or, more broadly, an effect of synchresis, and an effect of rendering which will give the sound a necessary degree of veridicality (Grimshaw, 2008) for it to seem “real, efficient and adapted” to “recreate the sensation [...] associated to the cause or to the circumstance evoked in the [game]” (Chion, 1990, p. 94, freely translated). For this to be effective, Grimshaw (2008) reminds us that a sound “must be as faithful as possible to its sound source [within the game], containing and retaining, from recording or synthesis through to playback, all the information required for the player to accurately perceive the cause and, therefore, the significance of the sound” (p. 73). However, we must not forget that computer games are not only audio-visual, but also interactive. Therefore, sound must also establish a sentiment of causality between the gamer’s actions which mostly correspond to the handling of joysticks and pressing buttons on their controller, and the action performed by the player character on the diegetic level. For this matter synchronisation points turn out to be less aesthetic and more pragmatic as they become the product of the gamer’s
200
will in act. This relationship between action and sounds is primordial in establishing the horror games conventions and greatly contributes to the effect of presence as it gives a sensory support for the gamer’s agency.
Gameplay Functions From a gameplay point of view, and following the loop of Arsenault and Perron’s (2008) model, sound performs two main functions: (1) to give information on the game-state and (2) to give feedback on the gamer’s activity in response to the game state. Before we engage in a typology of the different gameplay functions of sounds, I wish to mention that I am fully aware that every sound, while serving gameplay purposes, simultaneously has immersive and affective functions. However, for reasons of brevity, I will not integrate those functional poles together right away. In this line of thought, I will not present an exhaustive list of gameplay functions, and keep only those useful for my analysis of horror computer game sound strategies.10 Based on Collins (2008), Grimshaw’s (2008), Jørgensen’s (2008) and Whalen’s (2004) work, I wish to take a look at five gameplay functions that some horror game strategies are founded upon: spatial functions, temporal functions, preparatory functions, identification functions, and progression functions. In computer games, it is essential to determine the approximate location of the sound generators. Spatial functions allow for the localization of generators in terms of direction and distance, contribute to the quantification and qualification of game space and help the gamer to navigate through it. More precisely, the sounds will be described as choraplasts which are sounds “whose function is to contribute to the perception of resonating space [volume and time, localization]” (Grimshaw, 2008, p. 113). By privileging a “navigational” mode of listening (Grimshaw, 2008, p. 32), the augmentation or diminution of a sound’s intensity might, for instance, assist the gamer in localizing
Listening to Fear
the generators and help them decide if they want to advance, or not, in their direction. Sonic temporal functions are also very important to horror computer games. For example, in Resident Evil 5, the flamethrower and satellite laser-guide that the gamer needs to utilise in order to kill the dangerous Uroboros monsters are regularly required, respectively, to be filled with fuel or to regain energy. To signal that the weapons are recharging, in addition to a visual indicator, the game underlines this process with a distinctive sound. Similarly, when the replenishing is done, a tone will inform the gamer. The same assumptions can be applied to other weapons as reload times and rate of fire are sometimes vital to the survival of the player character. Sounds that are “affording the perception of time passing” are named chronoplast by Grimshaw (2008, p. 113). The preparatory functions, a term I have borrowed from Collins (2008), and which correspond to what Jørgensen (2006; 2008) called urgency functions, are sounds alerting the gamer that an event has occurred in the diegetic world or which forewarn them of the presence of an enemy within the immediate environment of the player character. For instance, in Dead Space, the alarm signalling that a section of the ship is being put into quarantine serves as an alert, while the off-screen moans of zombies in Resident Evil are considered a forewarning. It must also be acknowledged that adaptive and interactive (Collins, 2008) extra-diegetic music can also occupy these roles as they either punctuate an event or, as in Resident Evil 4, testify to the presence of infected Ganados. For their part, identifying functions, which were more accurately theorised by Jørgensen, (2006), correspond to the ability of a sound “to identify objects and to imply an objects value” (Identifying Functions section, para. 1). For example, the heavy footsteps and the characteristic music loop accompanying the presence of Nemesis in Resident Evil 3 (Capcom, 1999), as well as the screeching of Pyramid Head’s gigantic blade
in Silent Hill 2 (Konami, 2001) lead to a quick identification, while at the same time provide these characters with an imposing an threatening demeanour. The identifying functions’ use is not limited to distinguishing and qualifying enemies, it also “has a central role related to changes in game state and player state”11 (Jørgensen, 2008, The Role of Audio in a Gameplay Context section, para. 2). In Dead Space, when Isaac Clarke grunts in pain after taking a hit, it signals to the gamer that the player character’s physiological integrity has been affected. Musical loops can also signify transitions in the game state. In the Resident Evil series, the leitmotif associated with the “save room” means that the player character is in safety, while fast-paced music normally implies the presence of a threat or requires immediate attention from the gamer. Progression functions is a term I propose based on my reflections upon the motivational purpose of music proposed by Zach Whalen (2004) in his text Play Along: An Approach to Videogame Music. As Whalen explained, in Silent Hill, “the music is always in a degree of “danger state” in order to impel the player through the game’s spaces. The mood of the game is crucial to the horrific ‘feel’, but it also provides motivation by compelling continual progress through the game” (Silent Hill section, para. 1). I suggest that other sounds, such as the enemies’ sound cues or alarm sounds, can achieve a similar purpose and encourage (or sometimes discourage) the gamer to progress into the game. While these functions are mostly integrated in enemy-generated sounds, some segments of dialogue can also be considered as serving progression functions. For instance, in Dead Space, radio communications with Kendra and Hammond help the gamer to figure out how to reinitialise the ventilation system of the hydroponic station of the U.S.S. Ishimura. Of course, one single sound event can serve many of these functions simultaneously. Furthermore, as Jørgensen (2008) specified, “the functional roles of sounds [will be] judged with
201
Listening to Fear
different urgency in different situations even though the sound is exactly the same” (Player Interpretation of Audio in Context section, para. 1). While this quote was intended to portray the relationship existing between sound and context in multiplayer sessions of World of Warcraft (Blizzard Entertainment, 2004), it is, nevertheless, quite applicable to the single player games which characterize most of the horror computer game genre. It is in regard to the macro and micro contexts of the games that prioritisation of one function over another will be possible. With all this in mind, it is now time to take a look at how horror games partly build their sound strategies by playing with these functions.
HOrrOr cOMPUtEr GAMEs’ sOUND strAtEGIEs Horror computer games have been around for a long time. During the 1980s, many games such as Atari’s Haunted House (1981), Sweet Home (Capcom, 1989)12 and the videoludic adaptations of the movies Halloween (Wizard Video Games, 1983) and Friday the 13th (LGN, 1989) hit the shelves to satisfy gamers in quest of an adrenalin rush. However, as I explained in a chapter published in Horror Video Games: Essays on the Fusion of Fear and Play, the abstract graphics and synthesised sounds of those games could not provide a simulation of evisceration as convincing as certain computer games can provide today. Indeed, “at that time, the horror was more lurking in the paratextual material than the games themselves” (Roux-Girard, 2009, p. 147). As Mark J. P. Wolf (2003) explained: The boxes and advertising were eager to help players imagine that there was more to the games than there actually was, and actively worked to counter and deny the degree of abstraction that was still present in the games. Inside the box, game instruction manuals also attempted to add
202
exciting narrative contexts to the games, no matter how far-fetched they were. (p. 59) As Remi Delekta and Win Sical (2003) suggest in an article of the only issue of the Horror Games Magazine: “[Horror computer games] can not exist without a minimum of technical capacities: sounds, graphics, processing speed. Fear to exist needs to be staged and mise en scène needs means” (p. 13, freely translated). It was in 1992 that Alone in the Dark, designed by Frédérik Raynal, shook the entire videoludic scene by incorporating polygonal characters, monsters and objects in two-dimensional, pre-rendered backgrounds. While this simulated three-dimensionality opened a new “game space” allowing for novel possibilities in gameplay, it also created an innovative “playground” for imaginative sound designers.
between Horror and terror Before we begin our analysis of horror games’ sound strategies, I need to clarify that fear, terror, dread, horror, anxiety and disgust, while they are broadly analogous emotions, are not synonymous. Moreover, not all horror computer games try to generate this entire emotional spectrum13. Accordingly, while some games rely on visceral manifestations of fear such as horror and disgust, others create fear at a psychological level, generating suspense, terror and dread. To understand how games manage to scare gamers, we must first take a look at the difference between horror and terror. According to Perron14 (2004), horror is compared to an almost physical loathing and its cause is always external, perceptible, comprehensible, measurable, and apparently material. Terror, as for it, is rather identified with the more imaginative and subtle anticipatory dread. It relies more on the unease of the unseen. (p. 133) Of course, sound design plays a prominent role in setting these two poles up. On one hand,
Listening to Fear
sounds provoke spontaneous sensations using rendering effects of matter, and on the other, they contribute to the elevation of suspense by creating ambiguity between causes, uncertainty regarding the origin of the sounds and by limiting the information carried by the sound’s affordances. To achieve this, horror computer games rely on a plurality of strategies. In the preceding sections of this chapter, I introduced a number of theoretical tools to help us understand how gamers structure sounds within and without the gameworld and how they produce meaning with the different cues they listen to. I now propose to revisit those concepts in light of a horrific mise en scène to comprehend how horror games develop those strategies.
the choice of the sounds While horror computer games (and mostly survival horror games) utilize a wide range of sound strategies, the staging of fear starts at a purely formal level. The choice of sounds and the way they are used are greatly responsible for the quality of the mood of the games. Some empirical research (quoted in Grimshaw, 2009) attempted to demonstrate that there is a certain degree of correlation between the physical signal of a sound and the emotions felt by listeners. For instance, Cho, Yi, and Cho’s (2001) research on textile sounds shows that loud and high-pitched sounds are unpleasant to the ear, while Halpern, Blake, and Hillenbrand (1986) point to loud, low-mid frequencies as being disagreeable. Whereas these investigations seem contradictory, they nevertheless tend to reveal that the acoustic qualities of sounds can have, amongst other factors, a physiological as well as psychological impact on the gamer. However, to arouse emotions, we need much more than mere frequencies. Borrowing from Pierre Schaeffer’s theory (synthesised by Chion, 1983) on the morphological description of sounds and Quatre écoutes (écouter, ouir, entendre, comprendre), it is mostly the work performed on
the allure, grain, dynamic profile, and the mass profile of a sound that determines its repercussion on the gamer. During their gameplay activity, the gamer hears (entendre) the morphological qualities of the sounds which allow them to comprehend (comprendre) and experience them as frightening. Therefore, it is not only because the gamer listens (écouter) to what they can identifiy as a zombie that they are scared, but because they hear (entendre) a moan or a growl, which correspond to the sound motifs contained in their knowledge of horror symbols. Therefore, it is not so much because the lamentation is generated by a zombie and comprises low-frequencies that it is frightening but, because, in its essence, it contains an energy reminiscent of a certain form of pain and agony. Ambiences can have a similar effect as they associate acoustic qualities with unpleasant situations and frightening locales. Reciprocally, the emotions produced by these choices of sound force the gamer to focus on every little detail of the sound design and are partly responsible for the gamer’s high level of “perceptual readiness”. Of course, the selection of sounds must also aim to create uncertainty as this feeling is essential to the creation of suspense. To do so, designers sometimes have to baffle the gamer’s expectations to a certain point. In his book on the Silent Hill series, Perron (2006) observed an evolution, from one title to another, in the sound used to portray the monstrous nurses. As the author explains: “The nurses, which have a much low-pitched ‘voice’ in [Silent Hill], have a penetrating sped up respiration in [Silent Hill 3]” (Perron, 2006, p. 93, translated by the author). According to me, this purely aesthetic strategy has the effect of reducing the gamer’s “launch window” into the game, preventing him from using his anterior knowledge to identify (identifying functions) his opponents. Consequently, Silent Hill 3’s sound design created ambiguity regarding the cause of the sound, and forced the gamer to reconstruct, from game to game, the relation between the sounds and their generators.
203
Listening to Fear
However, as we have seen earlier, horror games do not create fear only with their aesthetic dimension, but also with their narrative structure and gameplay. Therefore, some of their strategies are also constructed from these two dimensions.
creation of a startle Effect Sound plays a preponderant role in the creation of a variety of surprise effects. Following an analysis of this phenomenon by Robert Baird, Perron (2004) explained in his text Sign of a Threat: The Effects of Warning Systems in Survival Horror Games, that the essential formula for creating a startle effect can be summed up into three steps: “(1) a character presence, (2) an implied offscreen threat, and(3) a disturbing intrusion [often accentuated by a sound burst] into the character’s immediate space” (p. 133). As noted by the author, it is indeed at the moment of the intrusion of the off-screen threat inside the screen that sound will take on all its importance. At this level, it is a question of contrast in the sonic intensity and synchronisation of the sound and its generator in the visual field of the gamer. Therefore, startle effects depend on the physical limitations of the ears. As ears are slower to react than the eyes, the startle effect will temporally cloud the gamer’s evaluation and identification operations. To favour such effects, horror games often rely on a refined sound aesthetic and create moments of approximate silence. We can also say that the sounds the gamer cannot hear—the noises an enemy should make while moving towards the gamer that are rendered inaudible—play a role as important as the ones he can hear. In Schaefferian terms, we could say the game plays on the limits of hearing (ouïr) as a way to fool the gamer’s listening (écouter). It is only into these considerations that the episodes of respite before an attack play a determining role in the staging of a startle effect. This is stressed by Whalen (2004):
204
As it is the case with horror films, the silence [...] puts the player on edge rather than reassuring him that there is no danger in the immediate environment, increasing the expectation that danger will soon appear. The appearance of the danger is, therefore, heightened in intensity by way of its sudden intrusion into silence. (Silent Hill section, para. 3) It is according to this technique that designers punctuated, by shattering a window, the intrusion of a long-fanged monster in Alone in the Dark (I-Motion, 1992), or in a similar incursion of a zombie-dog in Resident Evil (Capcom, 2002), intensified the attack of a crawling monster in Silent Hill 2 (Konami, 2001), or amplified the brutal opening of an elevator door by a necromorph in Dead Space. As Perron (2004) mentioned: “To trigger sudden events is undoubtedly one of the basic techniques used to scare someone. However, because the effect is considered easy to achieve, it is often labelled as a cheap approach and compared with a more valued one: suspense” (p. 133). Following this line of thought, if sound plays a decisive role when it comes to making a gamer jump out of his shoes, it also plays a role in the creation of suspense. It is in this perspective, towards dread and anticipation, that the next strategies will be explored.
the Impact of Forewarning To create suspense, forewarning is one of the most effective strategies. Before further developing this concept, it is essential to mention that forewarning is not always exclusively based on sound. Forewarning, which consists of alerting the gamer to the presence of a menace in the surroundings of his player character, can also be based on visual cues, as it is the case with Fatal Frame (Tecmo, 2002) when the indicator in the bottom of the screen turns orange, signalling the presence of a ghost. However, many forewarning
Listening to Fear
systems have been designed through sound. The most renowned case of such a technique and, incidentally, the most studied—being discussed by Carr (2003), Kromand (2008), Perron (2004) and Whalen (2004) —is the pocket radio in the Silent Hill series. This radio, which emits static when a threat is nearby, plays its role as a warning system perfectly. Forewarning can also be created through a more classical way through making use of off-screen sounds (Perron, 2004). This is the case in Alone in the Dark: The New Nightmare (Infogrames, 2001) when, during the numerous seconds necessary for the gamer to go down the stairs leading to the interior court of the fort, it is possible to hear sounds associated with plant monsters coming from outside the frame of the fixed virtual camera shots. If we could believe that such a warning, prefiguring the entrance of a gloomy monster inside the screen, could reduce the feeling of fear or uneasiness in the gamer, research cited in Perron’s (2004) work tends to prove the opposite. As the author himself specifies, “[…] simple forewarning is not a way to prevent intense emotional upset. It is worse than having no information about an upcoming event” (Perron, 2004, p. 135). Such a method creates terror by anticipation based on a fear of the unseen. However, what Perron fails to highlight, is that forewarning does not rely only on the sound function of the same name. To be really effective, the forewarning must be unreliable and/or the quantity of information about the localisation of the generator must be limited. This precision offers the opportunity to introduce another strategy of horror computer games which relies on the functions of game sound: luring the gamer with sound.
Luring the Gamer With sound In his master’s thesis, Serge Cardinal (1994) explained that “filmic writing favouring the emergence of a clear spatial structure will have the tendency to anchor the sound with its source,
will privilege without ambiguity the identification and localisation of the source with sound, will submit sound’s diffusion to the sound properties’ logic” (p. 53, freely translated). To create fear and strong feelings of discomfort, horror games execute a reversal of this concept making the generators of the sounds harder to identify and localize. In the example from Alone in the Dark: The New Nightmare described earlier, the designers have avoided creating an evolution in the morphological properties of the sounds of the plant monsters in relation to the player character travelling through the fort’s space. This technique is used to alter the information the sound is carrying regarding the distance separating the threat and the gamer’s player character. While listening carefully, the gamer remarks no variation in the dynamic profile and mass profile of the sound generated by the creatures of darkness even though the player character performs a descent which, if it were scaled, would be equivalent to a little less than a hundred meters. In this case, the designers intentionally reduce the quantity of information carried by the sound in a way that limits the gamer’s interpretation of space and time, as it is impossible to evaluate the distance between the player character and the monsters. However, this tweaking of the spatial and temporal functions of the sound allows for an emphasis to be put on its forewarning purpose, which is bound to influence the progression function of the sound. Preventing the easy localization of the source/generator of the sound has an effect of reinforcing the suspense established by the forewarning while simultaneously forcing the gamer to take a more prudent approach while going down the stairs. Many horror game strategies rely on creating a certain level of ambiguity regarding the origin of sounds within the gameworld. While this can be achieved, as suggested by Daniel Kromand (2008), by blurring the frontier between the diegetic and non-diegetic parts of the game, similar exercises can be performed between instances within the diégèse. This partly explains why I chose to
205
Listening to Fear
structure my analysis of horror computer game sounds around the concept of sound generators and functions of game sound. Indeed, those notions are best suited to describing the relationship between the different instances of sound, in that there is more in horror computer games than meets the ear.
Ambiguity between sound Generators A study of the relations that exist between the different categories of sound generators allows me to put forward some of the sonic strategies of horror computer games. One of the most basic strategies of those games is to design sound in a way that creates ambiguity between the different sound generators of the game. Indeed, if two or more generators manage to produce sounds of a similar nature, it will directly affect the cognitive process of the gamer, making it harder to localize the sources but also harder to classify the cues as more or less important regarding the game context. For the most part, these ambiguities will concern the spatio-temporal and preparatory functions of sound and will generate fear through anticipation. The first technique consists of blurring the line between the sounds generated by the player and those generated by enemies. If these two generators manage to produce similar sound cues through a common source, it is possible to believe that, for example, the movements of the gamer’s player character through space might nourish the suspense. I must admit that this technique is not widespread in horror computer games but, seeing as the game Dead Space manages to create such a doubt, it is worthy of being mentioned as similar modus operandi might be exploited in future horror games. Indeed, in Dead Space, the sounds emitted by the player character’s footsteps on the viscous organic matter which often covers the floors of the spaceship are very similar to the sounds produced through the interaction of the substance and the deformed limbs of the grotesque monsters roaming with intent to kill the player
206
character. After hearing the monster’s footsteps for the first time, the gamer’s perceptual readiness will augment regarding these sounds. However, since the sounds emitted by the gamer’s player character are so similar to and blend with those of the enemies, the movement of the player character on the gooey surface might signal a potential presence in the player character’s surrounding environment. The gamer will then be forced to adopt a more careful approach and look around more often than he would have normally done. The flesh covered sections of the spaceship also encouraged the designers to establish a similar relationship, much more common to horror computer games, between the sounds generated by the enemies and the gameworld. The game environments are often designed to generate ambiences that imitate the sounds generated by the threats of the games. As mentioned by Kromand (2008): “The [gamer]’s understanding of affordances can help to perform better [...] as certain sounds pass information regarding nearby opponents, but at the same time these exact affordances are mimicked by the ambiance.” (Welcome to Rapture section, para.4). Once again, this way of conceptualizing sound in the game favours the creation of doubt in the player regarding the real provenance of the sounds. To get back to our Dead Space example, the organic matter is sometimes surmounted by excrescences which randomly squirt blood when the player character passes by. The excretion sound is also reminiscent of the sound made by enemies and tends to mislead the gamer as to what generated the sound. Similarly, other ambiance sounds, such as the creaking of the ship’s hull, the rumbling of the machinery, and other metallic impacts are used to simulate the prowling of a necromorph in an air vent or in one of the ship’s corridors. Of course, Dead Space is not the only game that makes use of such strategies. As Ekman and Lankoski (2009) noted, in “Silent Hill 2 and Fatal Frame, the whole gameworld breathes with life, suggesting that somehow the environment itself is alive, sentient, and capable of taking action against the
Listening to Fear
player” (p. 193). This way of introducing “event sounds with no evident cause, sound not plausibly attributed to an inanimate environment” is, for that matter, the trademark of the Silent Hill series. This way of conceptualizing sound even extends to the atonal, extra-diegetic music of the game. This aesthetic choice allows me to introduce one last case of ambiguity between sound generators. Some horror games aim at creating ambiguity between the game system, the gameworld, and the enemies, the emphasis being put, as suggested by Kromand (2008), on blurring the line between elements that are part of the diegesis and others that are not. By choosing to exploit atonal music, which is closer to musique concrète than traditional orchestral or popular music, that blends and often merges with the ambient and dynamic sound effects of the game, designers manage to lure the gamer into thinking that there are more threats than there actually are. This technique also often succeeds at diverting the gamer’s attention from the real threats in the game. The most flagrant example of such a scrambling between the sounds emitted by enemies and the game system comes from Silent Hill. During a gameplay sequence in the alternate town of Silent Hill (Konami, 1999), the non-diegetic music, which is mostly constituted of metallic, industrial sounds, also includes in its loop a sound that is very similar to the sounds generated by the flying monsters of the game. Since the flying demons’ sounds are mixed very low within the music, the gamer, who is concentrated on his activity, probably won’t notice that this cue is repeated on a fixed temporal line and will be bound to associate this sound to an oncoming monster. A similar type of conception was also privileged in the sound design of Dead Space. As Don Veca, the lead sound designer of the game, underlined: “We […] approached the entire sound-scape as a single unit that would work together to create a dark and eerie vibe. [...] In this way, Dead Space has really blurred the line between music and sound design” (cited in Napolitano, 2008, First
Question section, para. 2). Therefore, as mentioned by Kromand (2008), “the constant guessing as to whether the sounds have a causal connection put the [gamer] in unusual insecure spot that might well build a more intense experience” (Conclusion section, para. 2), which has the effect of augmenting the level of fear in the player. As a unit, the techniques which aim at creating ambiguity between sound generators are based on the different circuits a sound can perform between the on-screen, the off-screen and the extra-diegetic. Indeed, it is by regularly making sounds pass from on-screen (which allows the player to identify the cause of the sound) to the off screen (where the sound serves as a forewarning of a threat) to the extra-diegetic (where sound simulates the presence of a threat), that videoludic sound manages to condition the gamer to be wary of everything he hears.
Fear and context Of course, fear will not only be induced by the morphological nature of a sound, by its fixed relation with its cause or the constructions of strategies. Fear, horror, and terror mostly depend on the context in which the sound is heard. At this level, many parameters will influence the perception the gamer will have of a sound: the spatial configuration, the general difficulty of the gamer, the number of enemies, the available resources, the available time and so forth. The global situation related to the perception of a sound will have a determining impact on the attitude a gamer will adopt towards this sound. A videoludic design favouring such game mechanics will therefore be an accomplice to the sound strategies.
cONcLUsION In an attempt to scare their gamers, horror computer games utilise different strategies of mise en scène. Testament to the dialogue between the
207
Listening to Fear
production and reception of the games, these strategies, to be efficient, must play with the gamer’s expectations—regarding the reading and listening constraints imposed by the genre and paratext— and exploit the cognitive schemes that help them to classify the information they receive during their gameplay sessions. In this line of thought, the games must create situations that will generate negative emotions such as fear, horror, and terror. As only the gamer gets access to these emotions, I privileged an approach oriented on the reception of sound in a gameplay situation rather than a mere analysis of technical data. It is consequently with a terminology that does not reference directly the game code or algorithm, but instead focuses on the gamer’s mental reproduction of the videoludic universe, that I attempted to explain the importance of sound in the development of horror computer game strategies. The gamer’s first objective being to insure the survival of their player character, their tasks mainly revolve around detecting all the intrusions that might become hazardous for their character. In these circumstances, gamers must structure the sounds they hear and extract from them all the information they need to properly respond to a given situation. This cognitive process has been broadly presented with the help of Arsenault and Perron’s model (Figure 1). More precisely, the gamer must determine the origin and the cause of the sounds. To do so, they must first determine if a sound is generated by an event present within the videoludic world or overhanging this world. The gamer must then refine this categorisation to establish more precisely what, between their actions, the enemies, the game environment, and the game system, is the generator of the sound. At the same time, they must pay attention to the affordances (their functions) of the sounds which might communicate information about the space, the time, the enemies, and the events occurring in the game environment. The gamer must then evaluate which affordance must be prioritised according to the circumstances.
208
To feel safe, the gamer must be able to quickly find answers to their questions. To arouse fear, horror games block this process. While the morphologic nature of a sound is sometimes enough to induce a strong feeling of discomfort, horror computer games mostly rely on sound strategies to reach their goal. From the startle effects to the creation of ambiguity between the sound generators, the games trick the gamer’s listening by limiting the information the sounds carry. Plunged into a universe of “un-knowledge” (Kromand, 2008), the gamer can only be scared by their gameplay experience. To be really effective, the sound strategies must be part of a whole and integrated into a global staging of fear, which also depends on the relationships between the sound and images, the gameplay, and the game’s narrative. In the end, it is the pressure applied by the genre, and the deconstruction of the structure and the functions of sound by the different in-game situations, that will determine the true impact of the sound strategies on the gamer.
rEFErENcEs Alone in the dark. [Computer game]. (1992). Infogrames (Developer). Villeurbanne: Infogrames. Alone in the dark: Inferno. [Computer game]. (2008). Eden Games S.A.S. (Developer). New York: Atari. Alone in the dark:The new nightmare. [Computer game]. (2001). DarkWorks (Developer).Villeurbanne: Infogrames. Altman, R. (1992). General introduction: Cinema as event . In Altman, R. (Ed.), Sound theory, sound practice (pp. 1–14). New York: Routledge. Arsenault, D., & Perron, B. (2009). In the frame of the magic cycle: The circle(s) of gameplay . In Perron, B., & Wolf, M. J. P. (Eds.), The video game theory reader 2 (pp. 109–132). New York: Routledge.
Listening to Fear
Arsenault, D., & Picard, M. (2008). Le jeu vidéo entre dépendance et plaisir immersif: les trois formes d’immersion vidéoludique. Proceedings of HomoLudens: Le jeu vidéo: un phénomène social massivement pratiqué, (pp. 1-16). Retrieved from http://www.homoludens.uqam.ca/index. php?option=com_ content&task=view&id=55 &Itemid=63. Boillat, A. (2009). La «diégèse» dans son acception filmologique. Origine, postérité et productivité d’un concept. Cinémas Journal of Film Studies, 19(2-3), 217–245. Bordwell, D. (1986). Narration in fiction film. New York: Routledge. Carr, D. (2003). Play dead: Genre and affect in Silent Hill and Planescape Torment. Game Studies, 3(1). Retrieved from http://www.gamestudies. org/0301/carr/ Chion, M. (1983). Guide des objets sonores: Pierre Schaeffer et la recherche musicale. Paris: Buchet/Chastel. Chion, M. (1990). L’Audio-vision. Paris: Nathan. Chion, M. (2003). Un art sonore, le cinéma: histoire, esthétique, poétique. Paris: Cahiers du Cinéma. Collins, K. (2008). Game sound: An introduction to the history, theory, and practice of video game music and sound design. Cambridge, MA: MIT Press. Dead space. [Computer game]. (2008). EA Redwood Shores (Developer). Redwood City: Electronic Arts. Dektela, R., & Sical, W. (2003). Survival horror: Un genre nouveau. Horror Games Magazine, 1(1), 13–16. Ekman, I., & Lankoski, P. (2009). Hair-raising entertainment: Emotions, sound, and structure in Silent Hill 2 and Fatal Frame . In Perron, B. (Ed.), Horror video games: Essays on the fusion of fear and play (pp. 181–199). Jefferson, NC: McFarland.
Fatal frame. [Computer game]. (2002). Tecmo (Developer). Torrance: Tecmo. Friday the 13th. [Computer game]. (1989). PackIn-Video (Developer). New York: LJN. Grimshaw, M. (2008). The acoustic ecology of the first person shooter: The player experience of sound in the first-person shooter computer game. Saarbrücken, Country: VDM Verlag Dr. Muller. Grimshaw, M. (2009). The audio uncanny valley: Sound, fear and the horror game. In Proceedings of Audio Mostly: 4th Conference on Interaction with Sound. Retrieved from http:// digitalcommons.bolton.ac.uk/cgi/viewcontent. cgi? article=1008&context=gcct_conferencepr. Halloween. [Computer game]. (1983). Video Software Specialist (Developer). Los Angeles: Wizard Video Games. Hauntedhouse.[Computer game]. (1981). Atari (Developer).Sunnyvale: Atari. Huiberts, S., & van Tol, R. (2008). IEZA: A framework for game audio. Gamasutra. Retrieved from http://www.gamasutra.com/view/feature/3509/ ieza_a_framework_for_game_audio.php. Jauss, H. R. (1982). Toward an aesthetic of reception. Minneapolis, MN: University of Minnesota Press. Jørgensen, K. (2006). On the functional aspects of computer game audio. In Proceedings of Audio Mostly – A Conference on Sound in Games (pp. 48-52). Retrieved from http://www.tii.se/ sonic_prev/images/ stories/amc06/amc_proceedings_low.pdf. Jørgensen, K. (2008). Audio and gameplay: An analysis of PvP battlegrounds in World of Warcraft. Game Studies, 8(2). Retrieved from http:// gamestudies.org/0802/articles/jorgensen. Jørgensen, K. (2011). Time for new terminology? Diegetic and non-diegetic sounds in computer games revisited . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. 209
Listening to Fear
Juul, J. (2005). Half-real: Video games between real rules and fictional worlds. Cambridge, MA: MIT Press. Kromand, D. (2008). Sound and the diegesis in survival-horror games. In Proceedings of Audio Mostly 2008 the 3rd Conference on Interaction with Sound (pp. 16-19). Retrieved from http:// www.audiomostly.com/images/stories/ proceeding08/proceedings_am08_low.pdf. Left 4 dead. [Computer game]. (2008). Turtle Rock Studios (Developer). Kirkland: Valve Software. Manovich, L. (2001). The language of new media. Cambridge, MA: MIT Press. Murray, J. (1997). Hamlet on the holodeck: The future of narrative in cyberspace. New York: The Free Press. Napolitano, J. (2008). Dead Space sound design: In space no one can hear intern screams. They are dead. (Interview). Original Sound Version. Retrieved from http://www.originalsoundversion. com/?p=693. Neale, S. (2000). Genre and Hollywood. New York: Routledge. Odin, R. (2000). De la fiction. Bruxelle: De Boeck. Perron, B. (2004). Sign of a threat: The effects of warning systems in survival horror games. In . Proceedings of COSIGN, 2004, 132–141. Retrieved from http://www.cosignconference. org/ downloads/papers/perron_cosign_2004.pdf. Perron, B. (2006). Silent hill: Il motore del terrore. Milan: Costa & Nolan. Resident evil 3: Nemesis. [Computer game]. (1999). Capcom (Developer). Sunnyvale: Capcom USA. Resident evil 4. [Computer game]. (2004). Capcom Production Studio 4 (Developer). Sunnyvale: Capcom USA.
210
Resident evil 5. [Computer game]. (2009). Capcom Production Studio 4 (Developer). Sunnyvale: Capcom USA. Cardinal, S. (1994). Occurrences sonores et espace filmique. Unpublished master’s thesis. University of Montréal, Montréal. Resident evil. [Computer game]. (1996). Capcom (Developer). Sunnyvale: Capcom USA. Resident evil. [Computer game]. (2002). Capcom (Developer). Sunnyvale: Capcom USA. Roux-Girard, G. (2009). Plunged alone into darkness: Evolution in the staging of fear in the Alone in the Dark series . In Perron, B. (Ed.), Horror video games: Essays on the fusion of fear and play (pp. 145–167). Jefferson, NC: McFarland. Silent hill 2. [Computer game]. (2001). KCET (Developer). Redwood City: Konami of America. Silent hill 3. [Computer game]. (2003). KCET (Developer). Redwood City: Konami of America. Silent hill. [Computer game]. (1999). KCEK (Developer). Redwood City: Konami of America. Stockburger, A. (2003). The game environment from an auditive perspective. In Proceedings of Level Up, DiGRA 2003. Retrieved from http:// www.stockburger.co.uk/research/pdf/ AUDIOstockburger.pdf. Sweethome. [Computer game]. (1989). Capcom (Developer). Osaka: Capcom. Taylor, L. (2005). Toward a spatial practice in video games. Gamology.Retrieved from http:// www.gamology.org/node/809. Whalen, Z. (2004). Play along: An approach to videogame music. Game Studies, 4(1). Retrieved from http://www.gamestudies.org/0401/whalen/. Wolf, M. J. P. (2003). Abstraction in the video game . In Perron, B., & Wolf, M. J. P. (Eds.), The video game theory reader (pp. 47–65). New York: Routledge.
Listening to Fear
Worldof Warcraft. [Computer game]. (2004). Vivendi (Developer). Irvine: Blizzard.
ENDNOtEs 1
KEY tErMs AND DEFINItIONs Allure: It is the amplitude or frequency modulation of a sound. Comprendre: According to Schaeffer, comprendre means grasping a meaning, values, by treating the sound like a sign, referring to this meaning as a function of a language, a code. Dynamic Profile: It is the temporal evolution of the sound’s energy. Écouter: According to Schaeffer, écouter, is listening to someone, to something; and through the intermediary of sound, aiming to identify the source, the event, the cause, it treats the sound as a sign of this source, this event. Entendre: According to Schaeffer, entendre, here, according to its etymology, means showing an intention to listen [écouter], choosing from what we hear [ouïr] what particularly interests us, thus “determining” what we hear. Grain: It can be defined as the microstructure of sound matter, such as the rubbing of a bow. Mass Profile: It is the evolution in the mass of a sound. For example, from pitched to complex. Mise En Scène: It is the organisation of the different elements that define the staging of a scene, or, in the case that interests us, the simulation of a gameplay sequence. Ouïr: According to Schaeffer, ouïr is to perceive by the ear, to be struck by sounds, it is the crudest level, the most elementary of perception; so we “hear”, passively, lots of things which we are not trying to listen to nor understand Videoludic: It is an adjecti.ve linked to videogames. The use of this term opens a door for the utilisation of sonoludic as an adjective for audio only games or computer games in which gameplay mechanics are mostly based on sound.
2
3
4
5
6
It must be mentioned that this chapter does not wish to theorize the perhaps ill-suited notion of videoludic genres—a fertile field of computer game research that should, in coming years, generate quite a debate–but wishes, rather, to use it as a tool to better understand how gamers structure their gameplay session in survival horror games. For space reasons, I chose to limit my analysis of these specific factors. Just keep in mind that the industry and the technology play a great part in the final rendering of the games. Note that the former definition is largely associated with reception issues while the later refers to the productions aspects of the games. Generic issues of survival horror games will therefore be approached as a “constraint of listening” from which the gamer will organise and evaluate the role of sound in a given context. Indeed, while playing a game, the gamer never has access to this code. As Arsenault and Perron (2009) explained, the gamer “only witnesses the [...] result of the computer’s response to his action. He does not, per se, discover the game’s algorithm which remains encoded, hidden and multifaceted” which means that “the notion that a gamer’s experience and a computer program directly overlap is a mistake” (p. 110). While this statement upholds the approach of this paper, it also calls for a use of terminology that can reflect a game audio structure with accuracy and that can be applied directly to a gameplay situation. I find necessary to make this distinction because the notion diegesis, which is now often broadly defined as “the fictional world of the story” (Bordwell, 1986) might be questionable as it sometimes seems to borrow too much from narrative theory. Étienne Souriau
211
Listening to Fear
7
212
(n.d.), in his original definition of the term, conceptualised the “diégèse” as a “‘world’ constructed by representation” (Boillat, 2009, p. 223, freely translated) and, as it is possible to deduce, which is not necessarily specific to a narrative theory. Following Souriau’s line of thought, “the diegetic level is characterized not only by ‘everything we take into consideration as being represented by the film’ but also by ‘the type of reality supposed by the signification of the film’” (cited in Boillat, 2009, p. 222, freely translated). According to Boillat (2009), Souriau refined this definition by assimilating the “diégèse” to “all that belongs, ‘in the indigibility’ [...] to the story being told, to the world supposed or proposed by the fiction of the film” (Boillat p.222, freely translated), this “all” making reference to three very important constituents: time, space, and the character. As it is also highlighted by Boillat, this second part of the definition is essential to the concept so as to prevent the “reducing [of] the ‘diégèse’, as it was often the case [...] to only the ‘recounted story’” (p. 222, freely translated). However, in his book De la Fiction, French semio-pragmatist Roger Odin makes a clarification regarding the dichotomy between the story and the diégèse. As he explained, the “diégèse” “cannot be mixed up with the story” but “provides the descriptive elements the story needs manifest to itself” (cited in Boillat, 2009, p. 234, freely translated). While trying to apply the concept of diégèse to videogames, one must acknowledge that it does not function following the requirement of fictional films and according to a pure “fictionalisation process” (Odin, 2000). The reconstruction of the diegetic stage works differently based partly on a process
8
9
10
11
12 13
14
of “systemic immersion” (Arsenault & Picard, 2008), allowing for more levels of communication between the gamer’s world and the gameworld. On these premises, whether certain sounds generated within the “diégèse” seem to address an instance without it or not, does not hold that much importance regarding the construction and integrity of the “diégèse”. I personally prefer to use the adjective extradiegetic instead of non-diegetic because I believe that, for example, survival horror games’ music is tightly linked to the events that are taking place in the diegetic world. In Hamlet on the Holodeck, Janet Murray (1997) defines agency as “the satisfying power to take meaningful action and see the results of our decisions and choices” (p. 126). For example Jørgensen’s (2006; 2008) response functions, even though they play an important role in the actual gameplay of survival horror games are not as important to the construction of the games’ strategies. For this reason, they will be left out of this chapter. For more information on sound function, see Grimshaw, 2008; Jorgensen, 2006 and 2008, and Collins 2008. I think “player character state” would be more appropriate as the gamers themselves remain in their living room. Only available in Japan. This allows for the differentiation between horror computer games, which are a broader category of the videoludic horror genre, and survival horror games, which can be referred to as games that maximize the elements of a horrific mise en scène. Following William H. Rockett’s line of thoughts.
213
Chapter 11
Uncanny Speech Angela Tinwell University of Bolton, UK Mark Grimshaw University of Bolton, UK Andrew Williams University of Bolton, UK
AbstrAct With increasing sophistication of realism for human-like characters within computer games, this chapter investigates player perception of audio-visual speech for virtual characters in relation to the Uncanny Valley. Building on the findings from both empirical studies and a literature survey, a conceptual framework for the uncanny and speech is put forward which includes qualities of speech sound, lip-sync, human-likeness of voice, and facial expression. A cross-modal mismatch for the fidelity of speech with image can increase uncanniness and as much attention should be given to speech sound qualities as aesthetic visual qualities by game developers to control how uncanny a character is perceived to be.
INtrODUctION As technological advancements allow for the representation of high fidelity, realistic, human-like characters within computer games, aspects of a character’s appearance and behaviour are being associated with the Uncanny Valley phenomenon. (A definition of the Uncanny Valley is provided in the first section of this chapter.) It seems that one of the main factors contributing to a character being regarded as lifeless as opposed to lifelike is the character’s speech. In 2006, Quantic Dream DOI: 10.4018/978-1-61692-828-5.ch011
revealed a tech demo (The Casting) for the computer game Heavy Rain (2006), in which the main character, Mary Smith, evoked a somewhat negative responsive from the audience (Gouskos, 2006). Criticism was made of the uncanny nature of Mary Smith’s speech in that it sounded strange and out of context with the given facial expression and emotion portrayed by this character. A closer inspection of the video showed that not only were there errors in the sound recording (disparities between the acoustics and the volume and materials of the room with excessive plosives contradicting the distant camera and microphone), but a lack of correct pitch and intonation for speech and a lack
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Uncanny Speech
of synchronization of speech with lip movement were factors that reduced the overall believability for this character (Tinwell & Grimshaw, 2010). A mismatch between the conveyed emotion of Mary Smith’s voice with her gestures and posture exacerbated how unnatural and odd the character was perceived to be. MacDorman (quoted in Gouskos, 2006), observed that a perceived asynchrony of lip movement with speech was one of the factors that people found disturbing about Mary Smith: In addition, there is sometimes a lack of synchronization with her speech and lip movements, which is very disturbing to people. People ‘hear’ with their eyes as well as their ears. By this, I mean that if you play an identical sound while looking at a person’s lips, the lip movements can cause you to hear the sound differently. Since Mary Smith was revealed in 2006, increasing technological sophistication for computer games has allowed for heightened realism of human-like characters. Cinematic animation is achieved not only for cut scenes and trailers containing full motion video (FMV) but also for animation during in-game play. For example, the phoneme extractor and facial expression tool Faceposer designed by Valve for titles such as Left 4 Dead (2008) and Half Life 2 (2008). However it would seem that speech, as a factor integral to the uncanny phenomenon, is often overlooked when compared to the aesthetic visual qualities of behaviour of a human-like character. So far there have been limited studies to ascertain which factors contribute to the uncanny for virtual characters. In response to the hearsay in mass media raised by characters such as Mary Smith, Tinwell and Grimshaw (2010) conducted a study to investigate how the cross-modality of image and sound might exaggerate the uncanny. The results from this study are referred to throughout all sections within this chapter as the Uncanny Modality (UM) study, unless otherwise stated from another study. Prior to this, much of the work on the uncanny had been
214
visually-based, excluding sound as a factor. As a way towards building a conceptual framework for the uncanny and virtual characters in immersive 3D environments, this chapter defines how characteristics for a character’s speech may exaggerate the uncanny by considering aspects such as synchronization of audio and video streams, articulation, and qualities of speech. The first section provides an exposition of the Uncanny Valley describing how the theory came about, previous investigation into the theory and potential limitations of the theory in relation to virtual characters. Previous authors (such as Bailenson et al., 2005; Brenton, Gillies, Ballin, & Chatting, 2005; and Vinayagamoorthy, Steed, & Slater, 2005) have suggested that uncanniness is increased when the behavioural fidelity for a realistic, human-like character does not match up with that character’s realistic, human-like appearance. The second section discusses how a cross-modal mismatch between a character’s appearance and speech may exaggerate the uncanny. For instance whether a character’s speech may be perceived as belonging to a character or not, based on that character’s appearance. The third section discusses how particular qualities of speech such as slowness of speech, intonation and pitch and how monotone the voice sounds, may influence perceived uncanniness and how such qualities might work to the advantage of those characters intended to elicit an eerie sensation. The results from the UM study (Tinwell & Grimshaw, 2010) revealed a strong relationship between how strange a character is perceived to be and the lack of synchronization of speech and lip movement. (Characters rated as close to perfect synchronization for lip movement and speech were perceived as less strange than those with disparities in synchronization.) The fourth section reviews the findings from this study and also puts forward future experiments that may
Uncanny Speech
help to define acceptable levels of asynchrony for computer games where uncanniness is not desired. For figures onscreen an over exaggeration of pronunciation for particular words can make the figure appear uncanny to the viewer as the figure seems absurd or comical (Spadoni, 2000). The fifth section considers how the manner of articulation of speech may influence the uncanny by examining the visual representation (viseme) for each phoneme within the choreography tool Faceposer (Valve Corporation, 2008). A summary is presented in the final section that defines the outcomes from this inquiry as to how speech influences the uncanny for realistic, human-like virtual characters as a way towards building a conceptual framework for the uncanny. It is intended that this framework is not only relevant to computer game characters but also for characters within a wider context of user interfaces. For example virtual conversational agents within therapeutic applications used to interact with autistic children to aid the development of communication skills. Also those virtual conversational agents used to deliver learning material to students within e-learning applications.
tHE UNcANNY VALLEY The subject of the uncanny was first introduced in contemporary thought by Jentsch (1906) in an essay entitled On the Psychology of the Uncanny. Jentsch described the uncanny as a mental state where one cannot distinguish between what is real or unreal and which objects are alive or dead. In 1919, to establish what caused certain objects to be construed as frightening or uncanny, Sigmund Freud made reference to Jentch’s essay as a way to describe the feeling caused when one cannot detect if an object is animate or inanimate upon encountering objects such as “waxwork figures, ingeniously constructed dolls and automata” (p. 226). Freud characterized the uncanny as similar to the notion of a doppelganger; the body replica
being at first an assurance against death, then the more sinister reminder of death’s omen “a ghastly harbinger of death” (p. 235). Building on previous depictions of the uncanny, the roboticist Masahiro Mori (1970, as translated by MacDorman & Minato, 2005) observed that a robot continued to be perceived as more familiar and pleasing to a viewer as the robot’s appearance became more human-like. However, a more negative response was evoked from the robot as the degree of human-likeness reached a stage at which the robot was close to being human, but not fully. Mori plotted a perpendicular slope climbing as the variables for perceived human-likeness and familiarity increased until a point was reached where the robot was regarded as more strange than familiar (see Figure 1). At this point (about 80-85% human-likeness), due to subtle deviations from the human-norm and the resounding negative associations with the robot, Mori drew a valley shaped dip. A real human was placed, escaping the valley, on the other side. Mori gave examples of objects such as zombies, corpses and lifelike prosthetic hands that lie within the valley. He also predicted that the Uncanny Valley would be amplified with movement as opposed to the still images of a robot. Mori recommended that for robot designers, it was best to avoid designing complete androids and to instead develop humanoid robots with human-like traits, aiming for the first valley peak and not the second which would risk a fall into the Uncanny Valley. As computer game designers working in particular genres continue the pursuit of realism as a way to improve player experience and immersion, designers have the second peak as a goal to achieve believably realistic, humanlike characters (Ashcraft, 2008; Plantec, 2008). To reach this goal and to assess if overcoming the Uncanny Valley is an achievable feat, further investigation and analysis of the factors that may exaggerate the uncanny is required.
215
Uncanny Speech
Figure 1. A diagram to demonstrate Mori’s plot of perceived familiarity against human-likeness as the Uncanny Valley (taken from a translation by MacDorman and Minato of Mori’s ‘The Uncanny Valley’)
Previous Investigation into the Uncanny Valley Since Mori’s original theory of the Uncanny Valley over thirty years ago, the increasing realism possible for virtual characters and androids has sparked a renewed interest in the phenomenon (Green, MacDorman, Ho, & Vasudevan, 2008; Pollick, in press; Steckenfinger & Ghazanfar, 2009). However, there have been few empirical studies conducted to support the claims of uncanny virtual characters and androids evident within new media (Bartneck, Kanda, Ishiguro, & Hagita, 2009; MacDorman and Ishiguro, 2006; Pollick, in press; Steckenfinger & Ghazanfar, 2009). Still images of both virtual characters and robots have been used for experiments investigating the Uncanny Valley. Design guidelines have been authored to help realistic, human-like, characters escape from the valley (for example, Green et al., 2008; MacDorman, Green, Ho, & Koch, 2009; Schneider, Wang & Yang, 2007; Seyama & Nagayama, 2007). MacDorman et al. focused on how facial proportions, skin texture and how levels of detail affect the perceived eeriness, human likeness, and attractiveness of 216
virtual characters. Schneider et al. investigated the relationship between human-like appearance and attraction with the results indicating that the safest combination for a character designer seems to be a clearly non-human appearance with the ability to emote like a human. Hanson (2006) conducted an experiment using still images of robots across a spectrum of humanlikeness. An image of a human was morphed to an android on one half of the spectrum and then the android to a mechanical-looking, humanoid robot on the other half. The results depicted an uncanny region between the mechanical-looking, humanoid robot and the android. In a second experiment, Hanson found that it was possible to remove the uncanny region within the same plot, where it had previously existed, by changing the appearance of the android’s features to a more “cartoonish” and friendly appearance. However the results from these experiments only provide a somewhat limited interpretation of perceived uncanniness based on inert (unresponsive) still images. Most characters used in animation and computer games are not stationary, with motion, timing and facial animation being the main factors contributing to the Uncanny Valley
Uncanny Speech
(Richards, 2008; Weschler, 2002). For realistic androids, behaviour that is natural and appropriate when engaging with humans, referred to as “contingent interaction” by Ho, MacDorman, and Pramono (2008, p. 170), is a key factor in assessing a human’s response to an android (Bartneck et al., 2009; Kanda, Hirano, Eaton, & Ishiguro, 2004). Previous authors (such as Green et al., 2008; Hanson, 2006; MacDorman et al., 2009; Schneider et al., 2007) state that the conclusions drawn from their experiments where still images had been used may have been different had movement (and sound) been included as a factor. The perception of the uncanny does not always have to provide a negative impact for the viewer (MacDorman, 2006). The principals of the uncanny theory can work to the advantage of engineers when designing robots with the purpose of being unnerving within an appropriate setting and context. Similarly, the uncanny may help in the success of the horror game genre for zombie-type characters. Building on these findings, Tinwell and Grimshaw (2010) conducted the UM study, using video clips with sound, to investigate how the uncanny might enhance the fear factor for horror games. The results showed that combined factors such as appearance and sound can work together to exaggerate the uncanny for virtual characters.
Not only was it suggested that a lack of lip/vocalization synchronization reduced how familiar a character was perceived to be, but a perceived lack of human-likeness for a character’s voice, facial expression, and doubt in judgement as to whether the voice actually belonged to the character or not, also reduced perceived familiarity.
Limitations of Mori’s theory Recent studies demonstrate weaknesses within the Uncanny Valley theory and suggest it may be more complex than the simplistic valley shape that Mori plotted in his original diagram (see Figure 1). Various factors (including speech) can influence how uncanny an object is perceived to be (Bartneck, et al., 2009; Ho et al., 2004; Minato, Shimda, Ishiguro, & Itakura, 2004; Tinwell & Grimshaw, 2009). Attempts to plot Mori’s Uncanny Valley shape cannot confirm the twodimensional construct that Mori envisaged. The results from experiments that have been conducted using cross-modal factors such as motion and sound imply that it is unlikely that the uncanny phenomena can be reduced to the two factors, perceived familiarity and human-likeness, and is instead a multi-dimensional model (see Figure 2).
Figure 2. The Uncanny Wall, (Tinwell & Grimshaw, 2009)
217
Uncanny Speech
When ratings for perceived familiarity were plotted against human-likeness, the results from Tinwell and Grimshaw’s experiment, using 100 participants and 15 videos ranging from humanoid to human with character vocalization, depict more than one valley shape. The plot is more complex than Mori’s smooth curve and the valley shapes less steep than Mori’s perpendicular climb. The most significant valley occurs between the humanoid character Mario, on the left and the stylized, human-like Lara Croft, on the right. The nadir for this valley shape is positioned at about 50-55% human-likeness that is lower than Mori’s original prediction of 80-85% human-likeness. Results from studies using robots with motion and speech are also inconsistent with Mori’s Uncanny Valley. MacDorman (2006) plotted ratings for perceived familiarity against human-likeness for an experiment using videos of robots from mechanical to human-like, including some stimuli with speech. The results showed no significant valley shape in keeping with the depth and gradient of Mori’s plot and that robots rated with the same degree of human-likeness can have a different rating for familiarity. Bartnek et al. (2009) found that when a robotic copy of a human was compared to that human for the two conditions movement (with motion and speech) and still, despite a significant difference in perceived human-likeness between the human and the android, there was no significant difference between perceived likeability for the android and the human. These results imply that movement may not be the only factor to influence the uncanny. Further investigation is required to assess how speech may contribute to a more multi-dimensional model to measure the uncanny. Uncertainty exists as to whether the meaning for Mori’s original concept may have been “lost in translation” (Bartnek et al., 2009, p. 270). The word that Mori used in the title for the Uncanny Valley is bukimi, which, translated in Japanese, stands for “weird, ominous, or eerie”. In English, “synonyms of uncanny include unfamiliar, eerie, strange, bizarre, abnormal, alien, creepy, spine
218
tingling, inducing goose bumps, freakish, ghastly and horrible” (MacDorman & Ishiguro, 2006, p. 312) while Freud used the word unheimlich to define the uncanny: Further confusing the issue, the root heimlich has two meanings viz familiar or agreeable and that which is concealed and should be kept from sight. Freud discussed both meanings in his 1919 essay and they are not necessarily mutually exclusive as we show below. However, despite a generic understanding for the word that Mori used, the appropriateness of the term shinwa-kan, (translated as familiarity) that Mori used in his original paper as a variable to measure and describe uncanniness has been addressed by previous authors. As an uncommon word within Japanese culture there is no direct English equivalent for the word shinwa-kan. The word familiarity stands for the opposite to unfamiliarity (one of the synonyms for bukimi), yet the word familiarity may be open to misinterpretation. Whilst strange is a typical term for describing the unfamiliar, familiarity might be interpreted with a variety of meanings including how well-known an object appears: for example, a well-known character in popular culture or an android replica of a famous person. Bartnek et al. (2009) proposed that with no direct translation shinwa-kan could be treated as a “technical term” in its own right however this may cause problems when comparing the results from one experiment to another where the more generic translation “familiarity” is used as the dependent variable (p. 271). Other words such as likeability (Bartnek et al., 2009) or unstrange (the opposite to strange) may be closer to Mori’s original intention, nevertheless the validity for experiments conducted into the uncanny may be more robust if a standard word were to be used as a dependent variable to measure and describe perceived uncanniness: that word has yet to be agreed upon. Conflicting views exist as to whether it is actually possible to overcome the Uncanny Valley. One theory put forward is that objects may appear
Uncanny Speech
less uncanny over time as one grows used to a particular object. Brenton et al. (2005) give the example of the life-like sculpture The Jogger by Duane Hanson: The sculpture will appear “less uncanny the second time that it is viewed because you are expecting it and have pre-classified it as a dead object”. The effect of habituation may also apply to those with regular exposure to realistic human-like virtual characters. 3D modellers working with this type of character or gamers with an advanced level of gaming experience may be less able to detect flaws within a particular character because they had grown accustomed to the appearance and behaviour for that character by interacting with it on a regular basis (Brenton, et al., 2005). Recent empirical evidence goes against this theory. The results from a study by Tinwell and Grimshaw (2009) showed that the level of experience for both playing computer games and of using 3D modelling software made little difference in detecting uncanniness. (Judgements for those with an advanced level of experience for perceived familiarity and human-likeness had no significant difference between those with lesser or no experience.) Tinwell and Grimshaw suggest it may never be possible to overcome the Uncanny Valley as a viewer’s discernment for detecting subtle nuances from the human norm keeps pace with developments in technology for creating realism. With a lack of empirical evidence to support the notion of an Uncanny Valley, the notion of an Uncanny Wall may be more appropriate (see Figure 2). Viewers who may at first have been “wowed” by the apparent realism of characters such as Quantic Dream’s Mary Smith (2006) or characters in animation such as Beowulf (Zemeckis, 2007) or The Polar Express (Zemeckis, 2004), soon developed the skills to detect discrepancies for such characters’ appearance and behaviour. Indeed, as soon as the next technological breakthrough in achieving realism is released, a viewer may be reminded of the flaws for a character that at first did not seem uncanny. In addition to the meaning of uncanny
as used in the Uncanny Wall hypothesis being an exposition of the first Freudian sense of heimlich/ unheimlich as described above, the undesired unmasking of the technological processes used in the production of a character, and the perception of those processes as flaws in the presentation of that character, allows us simultaneously and without contradiction to use the second meaning of heimlich: that which should remain out of sight. The concept of the Uncanny Wall (as opposed to the Uncanny Valley which always holds out the hope for a successful traversal to the far side), evokes a variety of myths, legends and modern stories (Frankenstein’s monster, for example, or the Golem) in which beings created by man are condemned to forever remain pale shades of those created by gods. Further studies would be required to provide evidence for the Uncanny Wall to substantiate the hypothesis that the Uncanny Valley is an impossible surmount for realistic, human-like virtual characters. As soon as the next character is released, announced as having overcome the Uncanny Valley, we intend to conduct another test using the same characters as in the previous experiment. If those characters previously rated as close to escaping the valley, such as Emily (Image Metrics, 2008), are placed beneath the new character as perceived strangeness increases, our prediction may be justified. In the meantime, a conceptual guide for uncanny motion and sound in virtual characters may be beneficial in aiding computer game developers to manipulate the degree of uncanniness.
crOss-MODAL MIsMAtcH For androids, if a human-like appearance causes us to evaluate an android’s behaviour from a human standard, we are more likely to be aware of disparities from human norms (MacDorman & Ishiguro, 2006; Matsui, Minato, MacDorman, & Ishiguro, H., 2005; Minato et al., 2004). Ho et
219
Uncanny Speech
al. (2008) observed that a robot is eeriest when a human-like appearance creates an expectation of a human form when non human-like elements fail to deliver to expectations. Also, a mismatch in the human-likeness of different features for a robot, for example, a nonhuman-like skin texture combined with human-like hair and teeth, elicited an uncanny sensation for the viewer. With regards to virtual characters it has been suggested that a high graphical fidelity for realistic human-like characters raises expectations for the character’s behavioural fidelity (Bailenson et al., 2005; Brenton et al., 2005; Vinayagamoorthy et al., 2005). Any discrepancies from the humannorm with how a character spoke or moved would appear odd. For humanoid or anthropomorphic characters with a lower fidelity of human-likeness (for example, Mario or Sonic the Hedgehog), differences from the human-norm would be more acceptable to the viewer: Expectations are lowered based on the more stylized and iconic appearance for that character. Despite seemingly strange behaviour with jerky movements or a less than human-like voice, the viewer will still develop a positive affinity with the character. Empirical evidence implies that humanoid and anthropomorphic type characters do escape the valley dip as Mori predicted, being placed before the first peak in the valley (Tinwell, 2009; Tinwell & Grimshaw, 2009). Evidence shows that for virtual characters (and robots) a perceived mismatch in the humanlikeness for a character’s voice based on that character’s appearance exaggerates the uncanny. As part of the Uncanny Modality survey (Tinwell & Grimshaw, 2010), 100 participants rated how human-like the character’s voice sounded and how human-like the facial expression appeared using a scale from 1 (nonhuman-like) to 9 (very humanlike). Strong relationships were identified between the uncanny and perceived human-likeness for a character’s voice and facial expression. The less human-like the voice sounded, the more strange the character was regarded to be. Uncanniness
220
also increased for a character the less human-like the facial expression appeared. Laurel (1993) suggests that to achieve harmony, there is an expectation for the sensory modalities of image and sound to have the same resolution. So that there is accord between visual appearance and behaviour for virtual characters we put forward that the degree of fidelity of humanlikeness for a character’s voice should match that character’s appearance, or otherwise risk discord for that character. To avoid the uncanny, attention should be given to the fidelity of human-likeness for a character’s voice in accordance with that character’s appearance. For high fidelity humanlike characters it is expected that that character should have a human-like voice of a resolution that matches their realistic, human-like appearance. However for mechanical-looking robots, a less human-like and more mechanical-sounding voice is preferable. The humanoid robot Robovie was intentionally given a mechanical sounding voice so that it appeared more natural to the viewer (Kanda et al., 2004). A voice that was too human-like may have been regarded as unnatural based on the robot’s appearance, thus exaggerating the uncanny for the robot. To test the Uncanny Valley theory with virtual characters, it has been suggested that it is not necessary to include characters from computer games as the level of realism achieved from gaming environments generated in real-time is less than that achieved for animation and film (Brenton et al., 2005). Some characters created for television and film have been proclaimed as overcoming the Uncanny Valley: In 2008, Plantec hailed the character Emily as finally having done so. Walker, of Image Metrics, states that whilst computer games would benefit from these more realistically rendered faces, it is not yet possible to achieve the same high level of polygon counts for in-game play as achieved for television and film due to technical restrictions: “We can produce Emily-quality animation for games as well, but
Uncanny Speech
it just can’t work in a real-time gaming environment” (as quoted in Ashcraft, 2008). Accordingly, for virtual characters used within computer games that are approaching levels of realism as achieved for the film industry, it may be advisable to reduce the level of human-likeness for a character’s voice to a level that is in keeping with that character’s appearance. Actors’ voices are typically used for realistic, human-like characters’ speech in computer games. Yet, if the level of fidelity for achieving human-like realism for computer games is less than that achieved for film, a less than human-like voice should be used to avoid the character being perceived as unnatural. Hug (2011) makes a similar point when discussing the similarities between indie game and animation film aesthetics. Hug describes an affinity between sound used in animation film or cartoons matches and the aesthetic style for the animation: “[S]ounds that are more or less de-naturalized in a comical, playful, or surreal way, which is characterized by a subservsive interpretation of sound-source associations”. He further uses the example of an explosion that occurs within the arcade game Grey Matter (McMillen, Refenes, & Baranowsky, 2008) as an intriguing case of “cartoonish” sound design “when an abstract dot hits a flying cartoon brain, the latter ‘explodes’ with sounds of broken glass”. Although a more cartoonish style of sound is used for the explosion, the sound seems more in keeping with the stylized appearance of the object to which the sound belongs. The visceral sounds of the impact are still evident despite the more simplistic nature of the sound. The acoustics appear more natural as the level of detail appears to match the stylized aestheticism of the film’s environment. Of course we do not suggest that cartoon-like voices be used with characters that are approaching believable realism in computer games, however the level of human-likeness may be subtly modified so that the perceived style of the voice sound matches the aesthetic appearance of the character. This absurd juxtaposition may be necessary to
reduce the uncanny for computer game characters due to the fact that they will always be playing catch up to the level of realism achieved for film. Refinements made to character’s voices over a spectrum of human-likeness ranging from humanlike to mechanical, may perhaps help to remove the uncanny where it was previously evident. Reiter notes that recently, more attention has been given to the quality of sound in computer games to keep up with the quality of realism achieved visually for in-game play and to provide a more cinematic experience. As a method of communication both diegetic and non-diegetic game sound enhances a game’s plausibility in that sound can “trigger emotions and provide additional information otherwise hard to convey” (Reiter, 2011). Distinctions made as to the quality of game sound are not simply due to the level of clarity, resolution, or digital output achievable for sound: “Perceived quality in game audio is not a question of audio quality alone” (Reiter, 2011). For speech, textures, emotive qualities and delivery style are attributes that contribute to the perceived quality and overall believability for a character. (Qualities of speech and the uncanny are discussed further in the following section.) Quality of speech is critical in portraying the emotive context of a character convincingly. However with regards to the uncanny, if the perceived realism and quality for a voice goes beyond that of the quality and realism for a character’s appearance, such a cross-modal mismatch could exaggerate the uncanny. Further experiments are required to test this theory. Building on the premise of Hanson’s (2006) experiment where the uncanny was removed from a morphed sequence of images from robot to human by making a robot’s features more “cartoonish” and friendly, similar changes could be made to the acoustics of speech for videos of realistic, human-like characters. Whilst the videos of characters would remain constant, the speech sound would be changed across a spectrum of human-likeness from mechanical to human-like. If our predictions are correct, char-
221
Uncanny Speech
acters will be perceived as more strange when the speech sounds too mechanical or too human-like in relation to the fidelity of human-likeness for a character’s appearance. A character may appear more natural and be perceived as more familiar once the fidelity of human-likeness for speech is adjusted to be regarded as matching that of a character’s appearance.
QUALItIEs OF sPEEcH Bizarre qualities and textures of speech served to gratify the pleasure humans sought in frightening themselves with early horror film talkies, for example the monster in Browning’s (1931) film Dracula. Some cinematic theorists argue that the success of films such as Dracula was due to an uncanny modality that occurred during the transition between silent to sound cinema (Spadoni, 2000, p. 2). Sounds that may have been perceived as unreal or strange due to technical restrictions of sound recording and production at the time were used to the advantage of the character Dracula. For early sound film, to produce the most intelligible dialogue for the viewer, the recording process required that words were pronounced slowly, emphasizing every “syl-la-ble” (Spadoni, 2000, p. 15). However, whilst words could be easily interpreted by the viewer, this impeded delivery style made the speech sound unnatural and unreal. Delivery of speech style also influenced how strange Dracula was perceived to be. In the role of Dracula, the acoustics of Bela Lugosi’s speech set the standard for what the “voice of horror” should be (Spadoni, 2000, pp. 63-70). The weird textures of Bela Lugosi’s voice were manipulated to create a greater conceptual peculiarity for the viewer, thus setting the eponymous character apart from other horror films. The distinctive vocal tone and pronunciation of Dracula’s speech were characteristics that critics acclaimed as the most shocking and chilling; “slow painstaking voices pronouncing each syllable at
222
a time like those of radio announcers filled the theatre” (p. 64). As Tinwell and Grimshaw state, paraphrasing Spadoni, (2010) the unique textures and delivery style for Dracula’s speech increased the uncanny for Dracula: Dracula’s voice, the ethereal voice of the undead, is compared to the voice of reason and materiality that is Van Helsing’s. In the former, the uncanny is marked by uneven and slow pronunciation, staggered rhythm and a foreign (that is, not English) accent and all this produces a disconnect between body and speech. Van Helsing’s speech, by contrast, is the embodiment of corporiality; authoritative, clearly enunciated and rational in its delivery and meaning. For zombie characters in computer games, comparisons have been made with horror film talkies as to the methods used to create and modify sound to induce an ambience of fear (Brenton et al., 2005; Perron, 2004; Roux-Girard, 2011; Toprac & Abdel-Meguid, 2011). Results from the UM study by Tinwell and Grimshaw (2009) to define cross-modal influences of image and sound and the uncanny in virtual characters show that particular qualities of speech (similar to those observed for early horror talkies) can exaggerate how uncanny a virtual character is perceived to be. Thirteen video clips of one human and twelve virtual characters in different settings and engaged in different activities were presented to 100 participants. The twelve virtual characters consisted of six realistic, human-like characters: (1) the Emily Project (2008) and (2) the Warrior (2008) both by Image Metrics; (3) Mary Smith from The Casting (Quantic Dream, 2006); (4) Alex Shepherd from Silent Hill Homecoming (Konami, 2008) and two avatars (5) Louis and (6) Francis from Left 4 Dead (Valve, 2008); four zombie characters, (7) a Smoker, (8) The Infected, (9) The Tank and (10) The Witch from Left 4 Dead; (11) a stylised, human-like Chatbot character “Lillien” (Daden Ltd, 2006); (12) a realistic, human-like
Uncanny Speech
zombie (Zombie 1) from the computer game Alone in the Dark (Atari Interactive, Inc, 2009) and (13) a human. Table 1 shows the median ratings for a character’s strangeness and for the speech qualities: whether the speech seemed (a) slow, (b) monotone, (c) of the wrong intonation, (d) if the speech did not appear to belong to a character, or (e) none of the above. Characters with the same median value for strangeness were grouped together and the median values for speech qualities were then calculated for those characters or groups. (Median values were used to indicate a central tendency for results, to help establish a clear overall picture of the vital relationships over multiple qualities of speech.) The results implied that, slowness of speech, an incorrect intonation, and pitch and how monotone the voice sounded increased uncanniness. A strong indirect relationship was identified between individual ratings for the variables “the speech intonation sounds incorrect” and “the voice belongs to the character”. This implies that if the intonation for a character’s voice is in keeping with what the viewer may have expected, this characteristic may contribute to the overall believability for that character. The two zombies the Witch and the Tank, from the computer games Left 4 Dead (Valve, 2008), were regarded as the most uncanny with a median strangeness rating
of just 2 (see Table 1). However it seems the unintelligible hisses and snarls from the Tank were regarded as sounds that this character was likely to make based on the Tank’s appearance and how he behaved. Likewise the inhuman cries and screeches from the Witch matched her seemingly pathetic and wretched appearance. Such sounds enhanced the believability of these characters as they were in keeping with their nonhuman-like appearance. The findings from the UM study provide empirical evidence to support the claims made by MacDorman (as quoted in Gouskos, 2006) that Mary Smith’s speech was one of the main contributing factors as to why she was perceived as uncanny. Twenty percent of participants observed a lack of correct pitch and intonation for Mary Smith’s speech. This implies that the pitch and tone for her voice may not have matched the facial expression exhibited by this character. The emotive qualities of speech may have seemed either inappropriate or out of context with how this character appeared to look and behave. The facial expression may not have matched nor accurately conveyed the emotive qualities of her voice. Attributes such as these raised doubts as to whether the voice actually belonged to this character or not, thus increasing the sense of perceived eeriness for this character.
Table 1. Median ratings for speech qualities for those characters or groups with the same median strangeness value. (Tinwell & Grimshaw, 2010). Note. Judgements for strangeness were made on 9-point scales (1 = very strange, 9 = very familiar) Median Strangeness for Character or Group
Slow
Monotone
Wrong intonation
Belongs
None
The Tank, The Witch, (Mdn = 2)
10
9.5
23.5
56.5
16.5
The Infected, The Smoker, Zombie 1, Chatbot, (Mdn = 3)
24
21.5
40
42
8.5
Mary Smith, (Mdn = 4)
8
3
20
20
8
The Warrior, Alex Shepherd, (Mdn = 6)
14
17
17
62.5
7.5
Louis, Francis, (Mdn = 7)
2.5
3.5
6.5
79.5
4.5
Emily, (Mdn = 8)
2
0
2
87
6
Human, (Mdn = 9)
1
15
4
72
6
223
Uncanny Speech
As well as being regarded of the wrong pitch, speech that is delivered in a slow, monotone way increased the uncanny for both zombie characters and human-like characters not intended to contest a sense of the real. Within the UM study, the Chatbot character received a less than average rating for perceived familiarity and was placed with three other zombie characters with a median strangeness value of just three (see Table 1). The Chatbot’s voice was rated individually as being slow (75%), monotone (59%), and of an incorrect intonation (76%). The “speech” for Zombie 1, grouped with the Chatbot character with a median strangeness value of three, was also judged individually as being monotone (29%), slow (42%), and of an incorrect intonation (34%). Including such qualities of speech for the zombie may have been a conscious design decision by developers to increase the perceived eeriness for a character intended to elicit an uncanny sensation. (As mentioned above, such qualities enhanced the overall impact for the monster Dracula.) However the crippled speech style for the Chatbot appeared unnatural and unreal. Such qualities for this character’s speech were factors that viewers found most annoying and irritating, exaggerating the uncanny for this character when perhaps this was not intended. Our results imply that uncanniness is increased if speech is judged to be of the wrong pitch, too monotone, or slow in delivery style. Whilst such qualities can work to the advantage of antipathetic characters by increasing the fear factor, these qualities may work against empathetic characters in the role of hero or protagonist within a game. A designer may wish the player to have a positive affiliation with the protagonist character, yet the designer may unwittingly create an uncanny sensation for the player with speech qualities that sound strange to the viewer. Speech prerecorded in a manner that is too slow or monotone to aid clarity for post-production purposes may be judged as unnatural and should be instead recorded at an appropriate tempo. Pitch and tone of speech that do not match the facial expression or given
224
circumstance for a character may be regarded as out of context and confusing for a viewer. To avoid the uncanny, attention should be given to ensuring that the pitch of voice accurately depicts the given emotion for a character and, once speech has been recorded at the correct pitch, that the facial expression conveys that emotion convincingly.
LIP-sYNcHrONIZAtION VOcALIZAtION The process of matching lip movement to speech is an integral factor in maintaining believability for an onscreen character (Atkinson, 2009). For first-person shooters (FPS) and other similar types of action game, there are limited periods during gameplay when attention is focused solely on a headshot of a speaking character. Close up shots of a player’s character, comrades or antagonists are predominantly used when exchanging information during gameplay or during cinematic cut scenes and trailers. The music genre of computer games provides an outlet for musicians to promote and sell their work (Kendall, 2009; Ripken, 2009). As well as FPS games, music games can a provide challenge for developers with regards to facial animation and sound. The Beatles: Rock Band (EA Games, 2009) highlights the recent success of the merger of music and computer games that use realistic, human-like characters to represent music artists. It has been found, however, that uncanny traits can leave viewers dissatisfied with particular characters within the context of a computer game (Tinwell, 2009). With emphasis directed at a character’s mouth as the vocals are matched to the music tracks, it is important that an artist’s identity be transferred effectively within this new medium (Ripken, 2009). Factors such as asynchrony may result in a negative impact on the overall believability for such characters. This section discusses the outcomes of a lack of synchrony for lip-vocalization narration in film
Uncanny Speech
and television and the corresponding implications for characters in computer games.
Lip syncing for television and Film The process of a viewer accepting that sound and image occur simultaneously from one given source is referred to as synchresis (Chion, 1994) or synchrony (Anderson, 1996).1 For early sound cinema, various methods of sound recording and post production techniques were applied before a viewer no longer doubted that a voice actually belonged to a figure onscreen. A perceived lack of synchronization between image and sound has been equated with much of the uncanny sensation evoked by films within the horror genre in early sound cinema (Spadoni, 2000, pp. 58-60). Errors in synchrony evoked the uncanny for a scene in Browning’s Dracula (1931). As a figure’s lips remained still, human laughter resonated within the scene. With no given body or source, the laughter is regarded as an eerie, disembodied sound. Whilst technology allows for some improvement with cinema speakers, televisions and personal computers, most sound is still delivered through some mechanism that is physically disjunct from the onscreen image (for example, via headphones or separate speakers). Tinwell and Grimshaw (2010) note that future technologies may overcome issues with asynchrony within the broadcasting industry: “Presumably, there will be no need for such perceptual deceit once flat-panel speakers with accurate point-source technology provide simultaneously a visual display” (p. 7). For human figures in television and film, viewers are more sensitive to an asynchrony of lip movement with speech than for visual information presented with music (Vatakis & Spence, 2005). Viewers are also more sensitive to asynchrony when sound precedes video and less so when sound lags behind video (Grant et al., 2004). Grant et al. found that for continuous streams of audio-visual speech presented onscreen, detectable asynchrony occurred at 50ms when sound preceded video,
with a smaller window of acceptable asynchrony for when sound lagged behind video at 220ms. Standards set by the television broadcasting industry require that the audio stream should not precede the video stream by more than 45ms and that the audio stream should not lag behind the video stream by more than 125ms (ITU-R, 1998). An asynchrony for speech with lip movement can lead to one misinterpreting what has been said: the McGurk Effect (1976). As a viewer, one can interpret what has been heard by what has been seen. Depending on which modality one’s attention may be drawn to for audio-visual speech (and depending on which syllable is used), the pronunciation of a visual syllable can take precedence over the auditory syllable. Conversely a sound syllable can take precedence over the visual syllable. Alternatively, as one comprehends the visual articulatory process of speech both automatically and subconsciously, one can combine the sound and visual syllable information to create a new syllable. For example, a visual “ga” coinciding with the sound “ba” can be interpreted as a “da” sound. (This type of effect was observed by MacDorman (2006) for the character Mary Smith’s speech, who was criticized for being uncanny.) A viewer’s overall enjoyment of a television programme can be disrupted if delays occur between transmission devices for video and audio signals. To prevent confusion or irritation for the viewer, sub-titles are often preferred to dubbing of speech for foreign works. (Hassanpour, 2009). Errors in the synchronization of lip movements with voice for figures onscreen (lip sync error) can result in different responses from the viewer depending upon the context within which the errors are portrayed. A study by Reeves and Voelker (1993) found that not only is lip sync error potentially stressful for the television viewer, but it can also lead to a dislike for a particular program and viewers evaluating the people displayed on the screen more negatively and as “less interesting, more unpleasant, less influential, more agitated, more confusing, and less successful” (p. 4). On the
225
Uncanny Speech
contrary, lip sync error has also been deliberately used to provoke a humorous affect for the viewer where the absurd is regarded as comical as opposed to annoying. For example, the intentionally bad dubbing for characters in “Chock-Socky” movies (Tinwell & Grimshaw, 2010).
Lip syncing for computer Games With increasing technological sophistication in the creation of realism in computer games, textbased communication systems have been replaced with virtual characters using actors’ voices. To create full voice-overs for characters, automated lip-syncing tools extract phoneme sounds from prerecorded lines of speech. The visual representation (viseme) for a particular sound is retrieved from a database of predetermined mouth shapes. Muscles within the mouth area for a 3D character are modified to create a particular mouth shape for each phoneme. Interpolated motion is inserted between the next phoneme and associated mouth shape to enable contingency of lip movement for words within a given sentence. For example, a specific mouth shape can be selected for the sound “sh” to be used in conjunction with other sounds within a word or line of speech. Full voice-overs for characters were generated for titles developed by Valve such as Left 4 Dead (2008) and Half Life 2 (2008) using this technique. A phoneme extractor tool within Faceposer allowed for the detection and extraction of phoneme sounds from prerecorded speech to be synchronized with a character’s lips. Whilst research has been undertaken to improve the motion quality of real-time data driven approaches for realistic visual speech synthesis (Cao, Faloustsos, Kohler, & Pighin, 2004), prior to the UM study (Tinwell & Grimshaw, 2010) there have been no attempts to investigate what impact lip-synchronization may have on viewer perception and the uncanny in virtual characters. Videos of 13 virtual characters ranging from humanoid to human were rated by 100 partici-
226
pants as to how uncanny and how synchronized speech with lip movement was perceived to be. (A full description of the stimuli used in the experiment is provided in the third section.) The results revealed a strong relationship between how uncanny a character was perceived to be and a lack of synchronization between lip movement and speech: those characters with disparities in synchronization were perceived as less familiar and more strange than those characters rated as close to perfect lip-synchronization. Synchronization problems with the recorded voice for early sound cinema heightened a viewer’s awareness that the figure was not real and was simply a manufactured artifact (Spadoni, 2000, p. 34). A viewer was reminded that figures onscreen were merely fabricated objects created within a production studio. The uncanny was increased as figures were perceived as, “a reassembly of a figure” easily disassembled within a movie theatre (Spadoni, 2000, p. 19). The results from the UM study (Tinwell & Grimshaw, 2010) imply that the implications of asynchrony for speech and the uncanny for human figures within the classic horror cycle of Hollywood film also apply to virtual characters intended for computer games. The zombie characters the Witch and the Tank from the computer game Left 4 Dead (2008), received less than average scores for perceived lip-synchronization. The jerky, haphazard movement of the Witch’s lips appeared disparate from the high-pitched cries and shrieks spewed out by this character. As the Witch proceeded to attack, her presence seemed evermore overwhelming as sounds appeared to emulate from an incorporeal and uncontrollable being in a similar manner to Dracula’s laughter noted earlier. Similarly, participants seemed somewhat confused by the chaotic movement and irregular sounds generated by the Tank character making the viewer feel panicked and uncomfortable. The stimuli for this study were presented in different settings and as different actions. Some were presented as talking heads, for example the
Uncanny Speech
Chatbot character, whilst others moved around the screen, for example the Tank and the Witch. A further study is required to determine the actual causality of lip-synchronization as a significant contributor towards the uncanny when not associated with other factors of facial animation and sound. Thus, we intend a further experiment to test the hypothesis: Uncanniness increases with increasing perceptions of lack of synchronization between the character’s lips and the character’s sound. At present there are no standards set for acceptable levels of asynchrony for computer games as there are for television. It may well be that these acceptable levels are the same across the two media but it might equally be the case that the interactive nature of computer games and the use of different reproduction technologies and paradigms propose a different standard. For example, perhaps it is the case that current technological limitations in automated lip-syncing tools require a smaller window of acceptable asynchrony for computer games than previously established for television. We hope the future experiment noted above will also ascertain if viewers are more sensitive to an asynchrony of speech for virtual characters where the audio stream precedes video (as has been previously identified for the television broadcasting industry).
ArtIcULAtION OF sPEEcH Hundreds of individual muscles contribute to the generation of complex facial expressions and speech. As one of the most complex muscular regions of the human body, and with increased realism for characters, generating realistic animation for mouth movement and speech is a challenge for designers (Cao et al., 2004; Plantec, 2007). Even though the dynamics of each of these muscles is well understood, their combined effect is very difficult to simulate precisely. Whilst motion capture allows for the recording of high fidelity facial
animation and expression, this technique is mostly useful for FMV. Recorded motions are difficult to modify once transferred to a three-dimensional model and the digital representation of the mouth remains an area requiring further modification. Editing motion capture data often involves careful key-framing by a talented animator. A developer may edit individual frames of existing motion capture data for prerecorded trailers and cut scenes yet, for computer games, most visual material is generated in real-time during gameplay. For ingame play, automatic simulation of the muscles within and surrounding the mouth is necessary to match mouth movement with speech. Motion capture by itself cannot be used for automated facial animation. To create automatic visual simulation of mouth movement with speech, computer game engines require a set of visemes as the visual representation for each phoneme sound. Faceposer (Valve, 2008) uses the phoneme classes phonemes, phonemes strong, and phonemes weak with a corresponding viseme to represent each syllable within the International Phonetic Alphabet (IPA). Prerecorded speech is imported into a phoneme extractor tool that extracts the most appropriate phoneme (and corresponding viseme) for recognized syllables. Editing tools allow for the creation of new phoneme classes, or to modify the mouth shape for an existing viseme. The UM study (Tinwell & Grimshaw, 2010) identified a strong relationship between how uncanny a character was perceived to be with a perceived exaggeration of facial expression for the mouth. The results implied that those characters perceived to have an over-exaggeration of mouth movement were regarded as more strange. Thus, uncanniness increases with increasing exaggeration of articulation of the mouth during speech. Finer adjustments to mouth shapes using tools such as Faceposer may prevent a perceived overexaggeration of articulation of speech, yet such adjustments are time consuming for the developer. If no original visual footage is available for speech,
227
Uncanny Speech
judgements made to correct mouth shapes that appear too strong or too weak are likely to be based on the subjective opinion of an individual developer. Even then, the developer is still constrained by the number of mouth and facial muscles available to modify within the 3D model, which may not include an exhaustive depiction of every single muscle used in human speech. To avoid the uncanny, working with the range of mouth shapes and facial expression that current technology allows for within tools such as Faceposer, the developer should at least avoid an articulation of speech that may appear overexaggerated. The mouth shape for the phoneme used to pronounce the word “no” (“n” in Faceposer) may be applicable if the word is pronounced in a strong, authoritative way, but would appear overdone and out of context if the same word was used to provide reassurance in a calming and less domineering manner. Indeed, if the developer wishes to create an uncanny sensation for a zombie character, adjusting mouth shapes so that articulation of speech appears over exaggerated may enhance the fear factor for such characters by increasing perceived strangeness. In the same way that a snarling dog or ferocious beast may raise the corners of their mouths to show their teeth in an aggressive way, viewers may be made to feel uncomfortable by overstated mouth movements that suggest a possible threat.
sUMMArY AND cONcLUsION In summary, attributes of speech that may exaggerate the uncanny for realistic, human-like characters in computer games are: 1.
2. 3.
228
A level of human-likeness for a character’s speech that does not match the fidelity of human-likeness for a character’s appearance An asynchrony of speech with lip movement Speech that is of an incorrect pitch or tone.
4. 5.
Speech delivery that is perceived as slow, monotone, or of the wrong tempo An over-exaggeration of articulation of the mouth during speech.
Whilst such characteristics of speech may adorn the spine tingling sensation associated with the uncanny for antipathetic characters in the horror genre of games, a developer may risk the uncanny if such characteristics exist for empathetic characters. The protagonist Mary Smith, as featured in the tech demo for the adventure game Heavy Rain (2006), may have been intended to evoke affinity and sympathy from the audience. Instead, Mary Smith was regarded as strange and abnormal: Uncanny speech for this character contributed to just such a negative response from the audience. The speech was not only judged as lacking synchronization with lip movement but an inaccurate pitch and lack of human-likeness raised doubt as to whether the voice actually belonged to the character or not. Attributes such as these reduced the overall believability for Mary Smith. However, for zombies such as the Tank and the Witch from the survival horror game Left 4 Dead (Valve, 2008), uncanny speech increased (in a desired manner) how strange and freakish these characters were perceived to be. The outcomes from this investigation show that the majority of characteristics for uncanny speech in computer games may be induced by current technological limitations in the production, reproduction, and control of virtual characters. Restrictions as to the range of facial muscles available to manipulate in automated facial animation tools used to generate footage in real-time is a current constraint for achieving realism in computer games comparable to film. It seems there is a lack of an exhaustive range of mouth shapes to fully represent each phoneme sound and variation of interpolation between syllables in a range of different contexts. Such constraints may contribute to a perceived asynchrony of speech and mouth
Uncanny Speech
shapes being used for syllables that do not accurately convey the prosody or context of speech. Computer games may always be playing catchup with the levels of anatomical fidelity achieved in film for facial animation, however developments in procedural game audio and animation may provide a solution for uncanny speech. As Hug states, the future of sound in computer games is moving towards procedural sound techniques that allow for the generation of bespoke sounds, to create a more realistic interpretation of life within the 3D environment. For in-game play dynamic sound generation techniques, “such as physical modelling, modal synthesis, granulation and others, and meta forms like Interactive XMF” will create sounds in real-time responding to both user input and the timing, position, and condition of objects within gameplay (Hug, 2011). Using procedural audio (speech synthesis in this case), a given line of speech may be generated over a differing range of tempos using a delivery style appropriate for the given circumstance. For example, the sentence “I don’t think so” may be said in a slow, controlled manner, if carefully contemplating the answer to a question. In contrast, a fast-paced tone may be used if intended as a satirical plosive when at risk of being struck by an antagonist. Procedural animation techniques for the mouth area may also allow for a more accurate depiction of articulation of mouth movement during speech. Building on the existing body of research into real-time, data-driven, procedural generation techniques for motion and sound (for example, Cao et al., 2004; Farnell, 2011; Mullan, 2011), a tool might be developed that combines techniques for the procedural generation of emotive speech in response to player input (actions or psychophysiology) (Nacke & Grimshaw, 2011) or game state. Interactive conversational agents in computer games or within a wider context of user interfaces may appear less uncanny if the tempo, pitch, and delivery style for their speech varies in response to the input from the person
interacting with the interface. Such a tool will aid in fine-tuning the qualities of speech that will, depending on the desired situation, reduce or enhance uncanny speech.
rEFErENcEs Alone in the dark [Computer game]. (2009). Eden Games (Developer). New York: Atari Interactive, Inc. Anderson, J. D. (1996). The reality of illusion: An ecological approach to cognitive film theory. Carbondale, IL: Southern Illinois University Press. Ashcraft, B. (2008) How gaming is surpassing the Uncanny Valley. Kotaku. Retrieved April 7, 2009, from http://kotaku.com/5070250/how-gaming-issurpassing-uncanny-valley. Atkinson, D. (2009). Lip sync (lip synchronization animation). Retrieved July 29, 2009, from http://minyos.its.rmit.edu.au/aim/a_notes/ anim_lipsync.html. Bailenson, J. N., Swinth, K. R., Hoyt, C. L., Persky, S., Dimov, A., & Blascovich, J. (2005). The independent and interactive effects of embodied-agent appearance and behavior on self-report, cognitive, and behavioral markers of copresence in immersive virtual environments. Presence (Cambridge, Mass.), 14(4), 379–393. doi:10.1162/105474605774785235 Ballas, J. A. (1994). Delivery of information through sound . In Kramer, G. (Ed.), Auditory display: Sonification, audification, and auditory interfaces (pp. 79–94). Reading, MA: AddisonWesley. Bartneck, C., Kanda, T., Ishiguro, H., & Hagita, N. (2009). My robotic doppelganger—A critical look at the Uncanny Valley theory. In Proceedings of the 18th IEEE International Symposium on Robot and Human Interactive Communication, RO-MAN2009, 269-276.
229
Uncanny Speech
Brenton, H., Gillies, M., Ballin, D., & Chatting, D. J. (2005, September 5). The Uncanny Valley: Does it exist? Paper presented at the HCI 2005, Animated Characters Interaction Workshop, Napier University, Edinburgh, UK.
Farnell, A. (2011). Behaviour, structure and causality in procedural audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global.
Browning, T. (Producer/Director). (1931). Dracula [Motion picture]. England: Universal Pictures.
Ferber, D. (2003, September) The man who mistook his girlfriend for a robot. Popular Science. Retrieved April 7, 2009, from http://iiae.utdallas. edu/news/pop_science.html.
Busso, C., & Narayanan, S. S. (2006). Interplay between linguistic and affective goals in facial expression during emotional utterances. In Proceedings of 7th International Seminar on Speech Production, 549-556. Calleja, G. (2007). Revising immersion: A conceptual model for the analysis of digital game involvement. In Proceedings of Situated Play, DiGRA 2007 Conference, 83-90. Cao, Y., Faloustsos, P., Kohler, E., & Pighin, F. (2004). Real-time speech motion synthesis from recorded motions. In R. Boulic & D. K. Pai (Eds.), Eurographics/ACM SIGGRAPH Symposium on Computer Animation (2004), 345-353. Chion, M. (1994). Audio-vision: Sound on screen (Gorbman, C., Trans.). New York: Columbia University Press. Edworthy, J., Loxley, S., & Dennis, I. (1991). Improving auditory warning design: Relationship between warning sound parameters and perceived urgency. Human Factors, 33(2), 205–231. Ekman, I., & Kajastila, R. (2009, February 11-13). Localisation cues affect emotional judgements: Results from a user study on scary sound. Paper presented at the AES 35th International Conference, London, UK. (2008). Emily Project. Santa Monica, CA: Image Metrics, Ltd. (2008). Faceposer [Facial Animation Tool as Part of Source SDK]. Bellevue, WA: Valve Corporation.
230
Freud, S. (1919). The Uncanny . In The standard edition of the complete psychological works of Sigmund Freud (Vol. 17, pp. 219–256). London: Hogarth Press. Gaver, W. W. (1993). What in the world do we hear? An ecological approach to auditory perception. Ecological Psychology, 5(1), 1–29. doi:10.1207/ s15326969eco0501_1 Gouskos, C. (2006). The depths of the Uncanny Valley. Gamespot. Retrieved April 7, 2009, from, http://uk.gamespot.com/features/6153667/index. html. Grant, W., Wassenhove, V., & Poeppel, D. (2004). Detection of auditory (cross-spectral) and auditory-visual (cross-modal) synchrony. Speech Communication, 44(1/4), 43–53. doi:10.1016/j. specom.2004.06.004 Green, R. D., MacDorman, K. F., Ho, C. C., & Vasudevan, S. K. (2008). Sensitivity to the proportions of faces that vary in human likeness. Computers in Human Behavior, 24(5), 2456–2474. doi:10.1016/j.chb.2008.02.019 Grey Matter [INDIE arcade game]. (2008). McMillen, E., Refenes, T., & Baranowsky, D. (Developers). San Francisco, CA: Kongregate. Grimshaw, M. (2008a). The acoustic ecology of the first-person shooter: The player experience of sound in the first-person shooter computer game. Saarbrücken, Germany: VDM Verlag Dr. Mueller.
Uncanny Speech
Grimshaw, M. (2008b). Sound and immersion in the first-person shooter. International Journal of Intelligent Games & Simulation, 5(1).
Jentsch, E. (1906). On the psychology of the Uncanny. Psychiat.-neurol. Wschr., 8(195), 21921, 226-7.
Grimshaw, M., Nacke, L., & Lindley, C. A. (2008, October 22-23). Sound and immersion in the first-person shooter: Mixed measurement of the player’s sonic experience. Paper presented at Audio Mostly 2008, Piteå, Sweden.
Kanda, T., Hirano, T., Eaton, D., & Ishiguro, H. (2004). Interactive robots as social partners and peer tutors for children: A field trial. HumanComputer Interaction, 19(1), 61–84. doi:10.1207/ s15327051hci1901&2_4
Half Life 2. [Computer game]. (2008). Valve Corporation (Developer). Redwood City, CA: EA Games.
Kendall, N. (2009, September 12). Let us play: Games are the future for music. The Times: Playlist, p. 22.
Hanson, D. (2006). Exploring the aesthetic range for humanoid robots. In Proceedings of the ICCS/ CogSci-2006 Long Symposium: Toward Social Mechanisms of Android Science, 16-20.
Laurel, B. (1993). Computers as theatre. New York: Addison-Wesley.
Hassanpour, A. (2009). Dubbing. The Museum of Broadcast Communications. Retrieved July 14, 2009, from, http://www.museum.tv/archives/ etv/D/htmlD/dubbing/dubbing.htm. Ho, C. C., MacDorman, K., & Pramono, Z. A. D. (2008,). Human emotion and the uncanny valley. A GLM, MDS, and ISOMAP analysis of robot video ratings. In Proceedings of the Third ACM/ IEEE International Conference on Human-Robot Interaction, 169-176. Hoeger, L., & Huber, W. (2007). Ghastly multiplication: Fatal Frame II and the videogame Uncanny. In Proceedings of Situated Play, DiGRA 2007 Conference, Tokyo, Japan, 152-156. Hug, D. (2011). New wine in new skins: Sketching the future of game sound design . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. ITU-R BT.1359-1. (1998). Relative timing of sound and vision for broadcasting. Question ITU-R, 35(11).
Left 4 dead [Computer game]. (2008). Valve Corporation (Developer). Redwood City, CA: EA Games. Lillian—A natural language library interface and library 2.0 mash-up. (2006). Birmingham, UK: Daden Limited. MacDorman, K. F. (2006). Subjective ratings of robot video clips for human likeness, familiarity, and eeriness: An exploration of the Uncanny Valley. ICCS/CogSci-2006 Long Symposium: Toward Social Mechanisms of Android Science. MacDorman, K. F., Green, R. D., Ho, C. C., & Koch, C. T. (2009). Too real for comfort? Uncanny responses to computer generated faces. Computers in Human Behavior, 25, 695–710. doi:10.1016/j. chb.2008.12.026 MacDorman, K. F., & Ishiguro, H. (2006). The uncanny advantage of using androids in cognitive and social science research. Interaction Studies: Social Behaviour and Communication in Biological and Artificial Systems, 7(3), 297–337. doi:10.1075/is.7.3.03mac Matsui, D., Minato, T., MacDorman, K. F., & Ishiguro, H. (2005). Generating natural motion in an android by mapping human motion. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, 1089-1096.
231
Uncanny Speech
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264(5568), 746–748. doi:10.1038/264746a0 McMahan, A. (2003). Immersion, engagement, and presence: A new method for analyzing 3-D video games . In Wolf, M. J. P., & Perron, B. (Eds.), The video game theory reader (pp. 67–87). New York: Routledge. Minato, T., Shimda, M., Ishiguro, H., & Itakura, S. (2004). Development of an android robot for studying human-robot interaction. In R. Orchard, C. Yang & M. Ali (Eds.), Innovations in applied artificial intelligence, 424-434. Mori, M. (1970/2005). The Uncanny Valley. In K. F. MacDormand & T. Minato (Trans.) . Energy, 7(4), 33–35. Mullan, E. (2011). Physical modelling for sound synthesis . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Nacke, L., & Grimshaw, M. (2011). Player-game interaction through affective sound . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Perron, B. (2004, September 14-16). Sign of a threat: The effects of warning systems in survival horror games. Paper presented at COSIGN 2004, University of Split, Croatia. Plantec, P. (2007). Crossing the Great Uncanny Valley. In Animation World Network. Retrieved August 21, 2010, from http://www.awn.com/articles/production/crossing-great-uncanny-valley/ page/1%2C1. Plantec, P. (2008). Image Metrics attempts to leap the Uncanny Valley. In The Digital Eye. Retrieved April 6, 2009, from http://vfxworld.com/?atype= articles&id=3723&page=1.
232
Pollick, F. E. (in press). In search of the Uncanny Valley . In Grammer, K., & Juett, A. (Eds.), Analog communication: Evolution, brain mechanisms, dynamics, simulation. Cambridge, MA: MIT Press. Reeves, B., & Voelker, D. (1993). Effects of audiovideo asynchrony on viewer’s memory, evaluation of content and detection ability. (Research Report prepared for Pixel Instruments, CA). Palo Alto, CA: Standford University, Department of Communication. Reiter, U. (2011). Perceived quality in game audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Richards, J. (2008, August 18). Lifelike animation heralds new era for computer games. The Times Online. Retrieved April 7, 2009, from, http://technology.timesonline.co.uk/tol/news/ tech_and_web/article4557935.ece. Ripken, J. (2009, October 19). Game synchronisation: A view from artist development. Paper presented at the Music and Creative Industries Conference 2009, Manchester, UK. Roux-Girard, G. (2011). Listening to fear: A study of sound in horror computer games . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Schafer, R. M. (1994). The soundscape: Our sonic environment and the tuning of the world. Rochester, VT: Destiny Books. Schneider, E., Wang, Y., & Yang, S. (2007). Exploring the Uncanny Valley with Japanese video game characters. In Proceedings of Situated Play, DiGRA 2007 Conference, 546-549.
Uncanny Speech
Seyama, J., & Nagayama, R. S. (2007). The uncanny valley: The effect of realism on the impression of artificial human faces. Presence (Cambridge, Mass.), 16(4), 337–351. doi:10.1162/ pres.16.4.337
Vatakis, A., & Spence, C. (2006). Audiovisual synchrony perception for speech and music using a temporal order judgment task. Neuroscience Letters, 393, 40–44. doi:10.1016/j.neulet.2005.09.032
Silent hill homecoming [Computer game]. (2008). Double Helix & Konami (Developer/Co-Developer). Tokyo, Japan: Konami.
Vinayagamoorthy, V., Steed, A., & Slater, M. (2005). Building characters: Lessons drawn from virtual environments. In Proceedings of Toward social mechanisms of android science, COGSCI 200, 119-126.
Spadoni, R. (2000). Uncanny bodies. Berkeley: University of California Press. Steckenfinger, A., & Ghazanfar, A. (2009). Monkey behavior falls into the uncanny valley. Proceedings of the National Academy of Sciences of the United States of America, 106(43), 18362–18366. doi:10.1073/pnas.0910063106 The Beatles. Rock band [Computer game]. (2009). Harmonix. Redwood City, CA: EA Games. The casting [Technology demonstration]. (2006). Quantic Dream (Developer). Foster City, CA: Sony Computer Entertainment, Inc. Tinwell, A. (2009). The uncanny as usability obstacle. In A. A. Ozok & P. Zaphiris (Eds.), Online Communities and Social Computing workshop, HCI International 2009, 12, 622-631. Tinwell, A., & Grimshaw, M. (2009). Bridging the uncanny: An impossible traverse? In Proceedings of Mindtrek 2009. Tinwell, A., Grimshaw, M., & Williams, A. (2010). Uncanny behaviour in survival horror games. Journal of Gaming and Virtual Worlds, 2(1), 3–25. doi:10.1386/jgvw.2.1.3_1 Toprac, P., & Abdel-Meguid, A. (2011). Causing fear, suspense, and anxiety using sound design in computer games . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey: IGI Global.
Warren, D. H., Welch, R. B., & McCarthy, T. J. (1982). The role of visual-auditory “compellingness” in the ventriloquism effect: Implications for transitivity among the spatial senses. Perception & Psychophysics, 30(6), 557–564. (2008). Warrior Demo. Santa Monica, CA: Image Metrics, Ltd. Weschler, L. (2002). Why is this man smiling? Wired. Retrieved April 7, 2009, from http://www. wired.com/wired/archive/10.06/face.html. Zemekis, R. (Producer/Director). (2004). The polar express [Motion picture]. California: Castle Rock Entertainment. Zemekis, R. (Producer/Director). (2007). Beowulf [Motion picture]. California: ImageMovers.
KEY tErMs AND DEFINItIONs Audio-Visual: An artifact with the components image and sound. Cross-Modal: Interaction between sensory and perceptual modes, in this case, of vision and hearing. Realism: Representation of objects as they may appear in the real world. Uncanny Valley: A theory that as humanlikeness increases, an object will be regarded as less familiar and more strange, evoking a negative effect for the viewer (Mori, 1970).
233
Uncanny Speech
Virtual Character: A digital representation of a figure onscreen. Viseme: A visual representation of a mouth shape for a particular speech utterance such as “k,” “ch” and “sh.” Those with hearing impediments can use visemes to lip read and understand the spoken language when unable to hear sound.
234
ENDNOtE 1
In the field of psychoacoustics, synchrony and synchresis are closely related to the ventriloquism effect.
235
Chapter 12
Emotion, Content, and Context in Sound and Music Stuart Cunningham Glyndŵr University, UK Vic Grout Glyndŵr University, UK Richard Picking Glyndŵr University, UK
AbstrAct Computer game sound is particularly dependent upon the use of both sound artefacts and music. Sound and music are media rich in information. Audio and music processing can be approached from a range of perspectives which may or may not consider the meaning and purpose of this information.Computer music and digital audio are being advanced through investigations into emotion, content analysis, and context, and this chapter attempts to highlight the value of considering the information content present in sound, the context of the user being exposed to the sound, and the emotional reactions and interactions that are possible between the user and game sound. We demonstrate that by analysing the information present within media and considering the applications and purpose of a particular type of information, developers can improve user experiences and reduce overheads while creating more suitable, efficient applications. Some illustrated examples of our research projects that employ these theories are provided. Although the examples of research and development applications are not always examples from computer game sound, they can be related back to computer games. We aim to stimulate the reader’s imagination and thought in these areas, rather than attempt to drive the reader down one particular path.
INtrODUctION Music and sound stimulate one of the five human senses: hearing. Any form of stimulation is subject to psychological interpretation by the individual
and a cause-and-effect relationship occurs. Whilst this relationship is unique to each individual up to a point, it is safe to assume that broad, often shared, experiences occur across multiple listeners. It can be argued that the emotional reaction and response of a listener to a sound or piece of
DOI: 10.4018/978-1-61692-828-5.ch012
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Emotion, Content, and Context in Sound and Music
Figure 1. Idealised Role of Emotion, Content and Context in a Computer Application
music is the single most important event resulting from that experience. The goal of this chapter is to explore the relationship between sound stimuli and human emotion. In particular, this chapter examines the role sound plays in conveying emotional information, even from sources that may be visual in origin. Equally, the chapter seeks to demonstrate how human emotion is able to flip this paradigm and influence music and sound selection, based on emotional state and consideration of the context of the user. The content being represented digitally provides the opportunity to gain a greater understanding of the information present in a data set. Information being stored often has a number of characteristic features and structural elements that can be identified automatically. For example, music generally contains an identifiable structure, which might consist of several movements, parts, or, more commonly, verses and choruses. However, such structure can almost be considered fractal, in that there are microscopic and macroscopic levels of organisation and also repetition, ranging from musical beats, bars, verses and choruses to the level of the song itself.
236
Contextual data provides additional information about factors that contribute to making the user interaction experience much more relevant and effective by acquiring knowledge of the external factors that influence decision making and the emotion of the user. The conceptual diagram of Figure 1 shows an idealised situation in which a large database of audio media is presented to the user through a suitable application (such as a computer game). In this scenario, the user’s emotion and context are analysed and compared against analysis of appropriate media content. This provides selection of the ‘best fit’ media that will further stimulate and engage the user in the most effective way. The chapter explains the fundamentals of emotional stimulation using sounds and music, whilst retaining relevance to the audiologist. We demonstrate that by analysing the information present within media and considering its applications, significant advantages can be gained which improve user experiences, reduce overheads, and aid in the development of more suitable, efficient applications: whether they be computer games or other audio tools.
Emotion, Content, and Context in Sound and Music
EMOtION Emotion is a key factor to consider in computer applications given that almost all applications will have some form of Human Computer Interface (HCI). Humans are emotional beings and the interaction with the machine will have some emotional effect on them to a greater or lesser extent. The computer, therefore, has an ability to invoke an emotional response in the user. The user may bring their own emotions with them to an interactive experience which has been affected by external factors in the environment around them (Dix, Finlay, Abowd, & Beale, 2003). The quality and resultant experience that a user has with a machine is important and this is also true when we consider the frequent interaction that we have with entertainment media and computer games.
Emotion in Multimedia The use of sound in multimedia, and especially in computer games, is commonplace. This is unsurprising if one considers that, in order to successfully engage a human user in an immersive experience, the interaction must be achieved through one of the primary human senses. Speech and hearing are hugely important in our daily lives and allow us as humans to send and receive large amounts of information on an ad-hoc basis. Naturally, it is hearing and the use of sound that we are interested in examining in this chapter. Sound is used in complementing and augmenting other stimuli, especially visual. Consider, for example, the last time you watched a horror movie and were embarrassed by the unintended jump or flinch you experienced at a big bang or crescendo that accompanied the appearance of the bad guy in the movie! Proof, if it were needed, that the constructive use of music and sound can provoke one of the most primal of human emotional instincts; fear. Sound in multimedia environments is classified into two distinct categories (see Jørgensen, 2011 for a fuller analysis of these terms):
•
•
Diegetic. Sound or music that is directly related, or at least perceived to be related, to the environment in which the subject is intended to be immersed. For example, in a movie this could be the sound coming from a television that is in the room pictured on screen. Another example would be the voices of the characters on screen or the sound of a character firing a gun or driving a car. In a nutshell, the subject is able to reasonably identify the source of the sound given the surrounding virtual environment Non-diegetic. These sounds are generally presented to augment or complement the virtual environment but come from sources that the subject cannot identify in the current environment. To go back to the horror movie example again, consider the famous shower scene from Alfred Hitchcock’s classic Psycho from 1960: the screeching, stabbing violin sounds as the character of Marion Crane is stabbed by Norman Bates (dressed as his mother). There is no reason for the watcher of Psycho to believe that there are a collection of violinists in the bathroom with Norman and Marion, rather the music is there to enhance the environment that is presented.
Emotion in computer Games Game players exhibit larger emotional investment in games than in many other forms of digital entertainment, primarily due to the interactive nature of the medium. Jansz (2006) argues that game players often emotionally immerse themselves in games to experience emotional reactions that cannot reasonably be stimulated in the real-world: a sandbox environment for emotional development and experience. This notion will probably be familiar to most readers, as many of us will have deliberately watched a scary movie to try and frighten ourselves and because we enjoy experiencing the sensations and physical responses
237
Emotion, Content, and Context in Sound and Music
of being frightened, provided we are within a controlled environment. Freeman (2004) provides a list of reasons that support the activation of emotion in computer games, citing “art and money” (p. 1) as the principle drivers, although his work focuses mainly on the latter, such as competitive advantages for games development companies, rather than direct benefit to consumers and game players. Nevertheless, as Freeman advocates, this awareness in the industry of the need to integrate emotion further into computer gaming, is evidence of market demand and big business interest in this exciting field. Emotion manifests itself in many ways and there is an identifiable physical symptom in the user. Whilst the studies discussed later concentrate on identifying physical emotional reaction, these have not always been directly linked to the player’s physical interaction with the game. However, research by Sykes and Brown (2003) describes an initial study that deals with investigating not just emotional response or reaction in users but emotional interaction with a game. Sykes and Brown also support the theory that emotional reaction and interaction represent significant potential in being able to adapt and manipulate gaming environments in response to the emotional and affective states of the user. Their investigation dealt with determining if the amount of pressure applied to the buttons of a computer game controller pad correlated with an increased level of difficulty in the game environment. A benefit of using this approach as opposed to galvanic skin response or heart rate monitoring is that those mechanisms can be altered by the environmental changes around the user whereas changes in pressure applied to the game controller are much more likely to have been caused by events occurring in the game. Their results indicated that players did indeed apply greater pressure to the game controller when a greater level of difficulty and concentration was required in the game. Although the study is preliminary and relatively small-scale, the authors’ methods
238
of analysis employ significance testing of the data collected. Ravaja et al. (2005) conducted experiments that attempt to evidence the impact of computer gameplay upon human emotions by employing an array of biometric measurements. This is based upon the generally held theory that emotion is expressed by humans in three forms: “subjective experience (e.g., feeling joyous), expressive behavior (e.g., smiling), and the physiological component (e.g., sympathetic arousal)” (p. 2). Taking this further, the authors make the point that the psychological connection between a player and a computer game exceeds pure emotion and touches cognition where players make assertions and links to the game: believing they are a super-hero or ninja warrior, for example. This work also highlights the issue that, until recently, research into emotional enjoyment and influence has focused upon non-interactive, mass media communication channels, such as television, film and radio. The wide range of measurements used by Ravaja et al. is concise and, as the authors indicate, few other studies have employed such a wide range of metrics when investigating emotional connection with computer gameplay. The authors use electrocardiogram (ECG)/inter-beat intervals (IBI), facial electromyography (EMG) and skin conductance level (SCL) as measurements during their experiments. The experiments showed that reliable results are achieved across a range of subjects in response to significant events in a game scenario (such as success, failure, poor performance and so on). This work provides very strong evidence that subjects exhibit strong, identifiable physical reactions that are typical during emotional arousal when playing with computer games. It supports the argument, made in this chapter, that emotion, through physical disturbance, is a strong method for detecting emotional state and response when interacting with computer games. Broadly speaking, positive and negative game events correlated to positive and negative emotional reactions in players. However, one
Emotion, Content, and Context in Sound and Music
point of note from the study is that the intuitively expected emotional response was not always the one that was encountered in subjects. One criticism of Ravaja et al.’s study is that, although a reasonable sample size was used (36 participants), the gender balance was almost 70% in favour of male participants. Whilst it can be argued that the gaming population is likely to be male in majority, the study could have reflected the situation more accurately. The paper does not attempt to account for this disparity or investigate whether a significant difference was present between the results of the male participants and female participants (see Nacke & Grimshaw, 2011 for indications of gender difference in response to game sound). Although beyond the scope of their paper, the work could have been much strengthened by performing some form of subjective response with subjects on their performance in the game scenarios, thus allowing a more valid conclusion by employing triangulation of quantitative and qualitative methods. This would complement the reliable results attained through their objective measurements. There is no doubt that emotion plays a significant affective role in computer gaming and that it has the potential to be used both as a reactive and interactive device to stimulate users. The emotion elicited in gamers is a function of both the content of the game as well as the context in which the user is placed, further justifying the aims and underlying concept of this chapter: that these three traits are inextricably linked and that further understanding and utilising them must therefore lead to more intense, immersive, and interactive gaming. Conati (2002), for example, considers how probabilistic models can be employed to develop artificial intelligence systems that are able to predict emotional reactions to an array of content and contextual stimuli in education games, with the aim of keeping the player engaged with the game. But what of sound linked to emotion in games?
the Use of Emotional sound in Games Research by Ekman (2008) bridges the gap between traditional movies and modern computer games by explaining how sound is used to stimulate emotions in each of these media. Ekman enhances her discussions with summaries of some of the numerous theories in the portrayal of emotional involvement experienced through sound and music. Perhaps most importantly in her work, Ekman emphasises the difference between the role of sound in movies as opposed to computer games. Principally, this is that sound in movies is present to enhance the narrative and heighten the experience whereas, in computer games, sound must perform not only this function but also serve as a tool for interaction, often to the extent where the narrative element is sacrificed in favour of providing informational content. Ekman’s work therefore suggests that incorporating diegetic and non-diegetic sounds into computer games significantly increases the level of complexity for the sound designer. Kromand (2009) feels strongly that sound can be used to influence a game player’s stress and awareness levels by incorporating suitable mixtures of diegetic and non-diegetic sound. He provides examples of several contemporary computer games that feature such affective sound. In particular, his work focuses on the popular BioShock, F.E.A.R. and Silent Hill 2 titles. Kromand’s work is an interesting starting point and introduction to the use of sound in games, especially in inducing more unpleasant sensations. He provides extensive discussion and illustrative examples and considers the concept of trans-diegetic sounds (Jørgensen, 2011) those which transcend the traditional barrier between diegetic and non-diegetic. Kromand concludes by proposing that mixtures of diegetic and non-diegetic sound can lead to confusion and uncertainty about the environment and actions around the game player. He hypothesises that this confusion is purposefully implemented in the game
239
Emotion, Content, and Context in Sound and Music
environment and that the uncertainty of events taking place adds to the emotional investiture of the player in the game. Though not as up-to-date as other works concerning computer games and human emotion, a corresponding work, which also looks at methods of eliciting emotional state in computer gamers, comes from Johnstone (1996). The age of this paper alone demonstrates the importance and significance of the emotional link between computer games and game players. His study concerns the discernment of emotional arousal by speech sounds made by users during their interaction with a computer game. Part of the rationale behind his approach is hypothesised to be because the feedback equipment of today (heart rate monitors and skin conductance devices) was not so readily or cheaply available in 1996. An interesting concept that is partially addressed by Johnstone is that spontaneous emotional speech sounds differ acoustically from those that are planned and considered. If this theory holds true, then it means that genuine emotional responses can be distinguished from planned responses. In effect, this is somewhat analogous to the use of voice stress analysis in lie detection scenarios. Johnstone indicates that this ability is also useful in a truly interactive manner, since it not only means that users or game player responses can be analysed to determine emotional valences, but also that synthesised speech, such as the voices of characters in games, could be manipulated in similar acoustic ways to provide more realistic and affective game environments and conjunctions. For diegetic sounds in particular, this presents a world of opportunity. The results of Johnstone’s initial study are promising though there are some methodological aspects of the research that would have benefited from tighter control. For example, subjects’ spontaneous speech sounds were recorded and analysed but they were also required to answer subjective questions to provide speech samples. By the very nature of such an enquiry, the subjects would
240
have been required to consider their response during which time the effects of spontaneity or the moment could well have been depleted. The results gained are not enough to fully support the idea of distinguishing spontaneous sounds from planned although there is evidence to suggest that this might be a logical progression in future. Nevertheless, the data collected shows promise in being able to determine notions of urgency and felt difficulty in the game environment from events that are associated with achieving the objectives of the game. Primarily this can be measured by changes in spectral energy levels, low frequency energy distribution, and shorter speech duration. In more recent work, Livingstone and Brown (2005) present theories and results that support the use of auditory stimuli to provide dynamic and interactive gaming environments. Whilst their paper explores the use of musical changes and emotional reactions in a general sense, part of their work is also devoted to investigating the application in gaming. Their underlying concept is that musical changes in the game can trigger emotional reactions in game players in a more dynamic manner than is currently the norm. Livingstone and Brown employ a rule-based analysis of symbolic musical content that relates to a fixed set of emotional responses. Their work demonstrates that by dynamically altering the musical characteristics of playing music, such as the tempo, mode, loudness, pitch, harmony and so forth, the user perceives different emotional intentions and contexts within the piece of music that is currently playing. Music, then, stimulating one of the five human senses, is capable of influencing emotional change within humans in a computer game environment. The work of Parker and Heerema (2008) presents a useful overview of how sound is used in diegetic and non-diegetic forms within computer games. They argue that greater use should be made of sound in order to enhance the game environment and experience. A primary exemplar used by them, is that sound should also serve as a tool for input
Emotion, Content, and Context in Sound and Music
and interaction with the game, rather than being present purely to be heard. They reiterate that sound in games at present is reactive rather than interactive. However, in this chapter, we suggest that sound is simply a tool of the emotions and that it is player emotion that should be interactive, rather than reactive, in order to provide a new level of computer gaming experiences. We feel this can be strongly underpinned by the use of sound. Parker and Heerema go on to describe audio gaming and provide a series of examples and discussions of scenarios where sound can be used as the primary interaction mechanism between the player and the computer game. These range from the player reacting to audio cues, providing the game with input using speech, or other sonic input, and by directly controlling sound and music in the game. Although concise and valid at representing the current state of play of sound in games, their work does not consider the affective nature of using sound in games. Emotion is triggered by sound and the two are intrinsically linked. Recent work by Grimshaw, Lindley, and Nacke (2008) seeks to formalise the relationship between a subject’s immersion in a game environment as a function of the auditory content. Grimshaw et al. employ a series of biometric techniques to provide insight into the human emotional and physiological response to the sonic actions and environment of a first-person shooter game. Their method employs a significant array of quantitative, physiological measurements that are correlated with subjective questioning. The deep complexity of human emotion and psychology is exposed in their work as a strong relationship between the results of these two investigative methods cannot be found. This deficiency is the subject of significant discussion by the authors and, unsurprisingly, it is suggested as an area for significant future investigation. It is important to place an emphasis on this point: although broad hypotheses and empirical evidence show sound and music play a large part in stimulating emotional responses in human subjects, the quantification of these effects, especially
objective measurement, is elusive. Subjective investigation has traditionally always been the forté of psychological and sociological researchers. It is for this reason that sound designers and scientists working in the field must have an awareness of these issues, especially the sound designer working in computer game and multimedia development. In short, emotion is highly difficult to measure in an absolute way. Bridging this gap must be done carefully and backed-up by considered research and investigation. There is a wealth of literature relating to the emotional impact of games. Equally, there is an increasing amount of published work concerning audio games; the majority of literature still concerns itself with traditional, visually-focused games. As the reader may have noticed in this section of this book, there are few studies that have concerned themselves with using sound as the primary interactive method whilst also monitoring and responding to the emotional reactions of the game player. It is just this sort of scenario that the studies and ideas presented in this chapter aim to inspire, support and help stimulate.
Are sound and Music really Important in Games? It is interesting to consider to what extent sound is perceived as being important by users in computer games. If we consider the move from the beeps and clicks that early computer games such as Space Invaders and Pong made to modern alternatives such as the Guitar Hero series, we can see that the computer games industry has certainly placed an increased focus on the use of sound and music in games. To this extent, we conducted research, by means of a user survey, into determining user awareness of sound in computer games. The work is documented in grater detail in (Cunningham, Grout, & Hebblewhite, 2006), but a summary of the important findings and discussions are provided here.
241
Emotion, Content, and Context in Sound and Music
Table 1. Overall Game Genre Preference of Survey Participants in Rank Order Game Genre
Preference (%)
Role-Playing Game (RPG)
39
Shoot-em-up
24
Strategy/Puzzles
12
Adventure
9
Sports
9
Simulation
3
Other
3
This survey was undertaken to establish the various factors that subjects considered important when it came to purchasing a new computer game. Our initial hypothesis was that users would rate factors such as playability and visuals of a game, much higher than the sound and music, demonstrating that the focus in the computer gameworld tends to be in the areas of the graphical and gameplay domains. The survey had a total of 34 respondents. A profile of the gamers participating in the survey, in terms of their game type preference, is shown in Table 1. We believe a future study should investigate whether the favoured game genre affects particular factors that users specifically look for in games. For example, role-playing games have been traditionally much more limited in terms of their graphic and aural flamboyance, with greater emphasis being placed upon story-line and depth, whilst action and adventure games are often much more visually stimulating. Figure 2 and Figure 3 illustrate the results of questions where participants were asked to indicate the most important and, since it was assumed prior to the study that the playability or gameplay would most likely be rated highly, the second most important feature that influenced game purchasing decisions. Not surprisingly, we found that the most important factor is playability. The rating for all the other possible factors are negligible, although somewhat surprising is that no participants rated
242
the sound or musical elements to be important when deciding upon a game to buy. Intriguingly, the ability to play a game online with other users took favour over sound, which is an intriguing insight into the mind of the 21st Century games player. Users who selected the “Other” category were prompted to provide a more detailed explanation. The responses received here all related to one of the following comments: “depth and creativity”, “the whole package” and two participants stated that the “story or scenario” was most important. It is argued, on the basis that playability will always rate highest, that the results in Figure 3 are more insightful than those in Figure 2. After all, the whole notion of computer games is that they are to be played with! This time we see, as we might well have expected, that the graphics and visual stimulation was the most popular factor. As expected, the sound present in a game was cited by a low percentage of those surveyed. The users who chose the “Other” category on this occasion also stated that the factor important to them related to the story of the game. Encouragingly, however, and still applicable in the context of sound in games, is the percentage of users that value the interface. If we consider some of the most recent successful games, where the use of music and sound has been prominent, these titles almost all employ an interactive sound interface of some form. Prime examples include the Guitar Hero, Rock Band,
Emotion, Content, and Context in Sound and Music
Figure 2. Rating the Most Important Game Feature
Dance Dance Revolution (DDR), and SingStar series of games, as well as Battle of the Bands, Ultimate Band and Wii Music, to name just a few. For the budding entrepreneur game developer, it is probably worth taking note that the majority of these titles revolve around the player being placed in a live music performance scenario or band. We briefly attempted to analyse these two assump-
tions through our survey, though the results are inconclusive with an almost 50/50 split between positive and negative responses. However, it is worth bearing in mind that these responses are now somewhat dated. An overview of the responses is presented in Table 2. It is reasonable to suggest that the soundtrack of a game brings an added attraction when it comes
Figure 3. Rating the Second Most Important Game Feature
243
Emotion, Content, and Context in Sound and Music
Table 2. Survey of Participants’ Interest in Game Soundtrack and Music Does the soundtrack/music of a computer game make you more interested in playing or buying it?
to a gamer parting with their hard-earned cash. As mentioned earlier, game series like Grand Theft Auto, FIFA and Dave Mirra feature music by well-known recording artists, in some cases including music that has been commissioned specifically for that game. It can be seen in the results summarised in this section that, other than the added value suggested above, users do not place any particular emphasis on game sound. As was expected, the main aspects users were interested in were the playability and graphics of a game, although interaction with sound offers great potential. The development of new sound-motivated games will be a dynamic and challenging field in the years to come, though we must not forget the golden rule of a successful game: playability. The under-use of sound in games is further supported by Parker and Heerema (2008), a source the reader is encouraged to investigate if they are still in doubt as to the true potential of sound in the gaming environment. To quote directly from their work: “The use of sound in an interactive media environment has not been advanced, as a technology, as far as graphics or artificial intelligence” (p. 1). Their work goes on to justify these assertions and they explain that poor quality sound in a game often results in the game being unsuccessful in the marketplace, whilst the success of a game containing an acceptable or higher quality of sound will be based upon other factors such as playability or graphics. It is clear from the discussions and investigation covered in this section that human interaction and psychological and emotional links with games are more and more to the fore, as well as becoming increasingly important in the development of suc-
244
Yes
50%
No
44%
Don’t Know
6%
cessful gaming experiences. It is fair to assume that users are not only affected by sound and music but that they also respond to feedback and interact with the game, essentially providing full-duplex communication between human and machine that is becoming increasingly information-rich. It is these interactions and information that the rest of this chapter focuses on.
cONtENt Digital audio data holds much more information than the raw binary data from which it is constituted. At its barest, sound and music are generally provided to augment and provide realism to the current scenario. However, as we demonstrated in the previous section of this chapter, computer games are truly multimedia experiences that combine a range of stimuli to interact with the user. In short, we see the area of content analysis as providing a semi-intelligent mechanism with which to tie together one or more media employed in a multimedia environment in order to provide even more effective and efficient interaction and experience. Content of a particular medium can take many different forms, some of which will be shared across a range of media while others will be exclusive to a particular media type. The following is an attempt to briefly describe and exemplify these two categories: •
Shared content information. If we consider an entire multimedia artefact as being a hierarchical object, greater than the sum of its parts, then shared content informa-
Emotion, Content, and Context in Sound and Music
•
tion would be found attached to each of the media components present. For example, the publishing house, year of production, copyright information, and name of the game in which a multimedia object (a sound or otherwise) appears will always be the same. This information is generally that which is exclusively available in the form of meta-data and requires little data mining to extract Exclusive content information. This is information about the content that can only be found in a given type of media. Although the same content information may appear in multiple instances of that media, it will generally be exclusive to that type of media. For example, if we consider the music present in a computer game, the exclusive content information would include the tempo, amplitude range, time signature, spectral representation, selfsimilarity measurement, and so on.
The relationship between sound and visual elements has been a mainstay of the media field since its inception. Consider the music video and Hollywood movie. Careful correlation occurs in these areas between the content presented to the user in these fields. Prime examples of these include the synchronisation between actions and transitions appearing in the visual field and the sound content. An illustration of this that the authors find particularly effective is in the opening sequence of the 1977 movie Saturday Night Fever. This particular scene sees the watcher treated to shots of John Travolta’s feet, pounding the streets of New York in time to the Bee Gee’s classic Stayin’ Alive: a classic in its own right and an almost ridiculously simple example of the sound content being combined in the production of the visual content to produce something that has a much greater impact than either of the two individual components.
As Zhang and Jay Kuo (2001) demonstrated, it is quite possible to extract and classify a range of different sound content types from multimedia data, especially the kind of mixes found in traditional entertainment like television shows and movies. Though their work is focused on the traditional media of multimedia communication, the computer game environment is simply a natural extension of this, with the major difference being the integration of an element of interactivity. It is these principles that we hope content analysis allows us to build upon and utilise in the field of electronic media processing and development. In particular, we hope that game sound content can be analysed to provide an enhanced gaming experience. As a good starting point for consideration, we began to explore the relationship between visual information and music in electronic media, to provide an augmented experience when viewing the visual data. In another of our works (Davies, Cunningham, & Grout, 2007) we attempted to generate musical sequences based upon analysis of digital images: in that particular case, those of photographs and traditional works of art. The underlying thoughts and questions that motivated that research revolved around suggestions such as: What would the Mona Lisa sound like? We felt this would also provide additional information for people who were, for example, visually impaired, and it could be used to provide added description and emotional information relating to a particular still image. It became a logical ethos that the only way in which this could be achieved would be to analyse the content of the image, as it is this that contains the information and components required to relay the same information but in an alternative format. A tool that we have found very effective in analysing musical content is that of the Audio Similarity Matrix (ASM), based upon ideas initially proposed and demonstrated by Foote (1999). This allows a visual indication of the self-similarity, and therefore structure, of a musical piece. We suggest further reading into Foote’s work as a
245
Emotion, Content, and Context in Sound and Music
Figure 4. ASM of ABBA’s “Head Over Heels” (28 second sample)
good starting point to stimulate the imagination into how content analysis can provide highly useful information for a variety of scenarios. A graphical example of an ASM is presented in Figure 4 as an exemplar, where dark colours represent high similarity and bright colours show low similarity. Whilst we do not limit the application of content analysis to computer games, we suggest a few examples of appropriate situations where it may be used. Simple examples relate to the link between visuals and sound. In a game where the scene is bright and full of strong, primary colours, it would be pertinent to include sound that reflects this notion: bright and strong in timbre. On the other hand, a dark, oppressive scene would require slower, darker music with a thinner and sharper timbre, inducing a different set of emotions. In today’s dynamic computer games, where the user has an apparently boundless freedom to explore a virtual environment, the dynamic updating of sound content to match the visuals requires some 246
form of content analysis. Even a simple parameter that defined the “colour” of the scene or presence of tagged objects nearby would suffice on a basic level. Another suggestion would be to manipulate the gameplay by the choice of music and sound prescribed by the user. For example, the same game scenarios and task may be undertaken at different speeds, levels of difficulty, and in different environments, based upon the choice of music the user makes. Consider the scenario where a user may decide to play dance music whilst interacting with a game, thereby instigating a bright, quantised environment with predictable, rhythmic, structured gameplay content. Whereas if they choose highly random, noisy, alternative noise-core they would be presented with chaotic, overwhelming game scenarios: in both cases, a reflection of the structure and content of the music that can be achieved only through detailed signal and structural content analysis of the audio data.
Emotion, Content, and Context in Sound and Music
cONtEXt Context awareness also provides opportunities for a heightened user experience with digital media systems, particularly those that hold large data sets, the content of which may only be relevant to a user in certain usage scenarios. We believe the incorporation of contextual information into digital devices provides a more tailored experience for users. Contextual information can be considered as an added extra in digital media systems, allowing more defined information about the user to be brought into software systems. Recommendation systems, for example, are a great example of where contextual data can be included. Schmidt and Winterhalter’s (2004) work in elearning is a good example of how context awareness can be incorporated into digital, computerbased communication media. In their field, the context of the user is particularly important as it allows greater control and focusing of learning and teaching materials in order to engage at a deeper level with the user. Their work emphasises that the key stages of context awareness are in first acquiring contextual information and then building a suitable user-context model so as to estimate the current context of the user. Schmidt and Winterhalter also reinforce notions that good contextual modelling comes by acquiring information from a range of sources. Most importantly, in discussions of the importance of user context, Schmidt and Winterhalter hit upon the key questions that context awareness is able to begin to address: “How do we know what the user currently does, or what he intends to do?” (p. 42). Schmidt and Winterhalter choose to employ more passive mechanisms for contextual data acquisition, such as those which passively track user progress through tasks and record commonly accessed information. This is perfectly suitable for e-learning applications but, in the field of computer games and interactive entertainment, we feel that something a bit more fortuitous is required.
A crucial work that backs up these notions of more interactive and reliable context awareness, especially when it comes to the surrounding environment, can be found in Clarkson, Mase, and Pentland (2000). Although this work may now be slightly dated, the principles and techniques employed in their work are effective and provide good examples of the type of contextual information that can be acquired by using simple sensor input. Their work investigates how context, such as whether the user is on a train or at work and whether they are in conversation or not, can be estimated from sensor input, primarily a camera and microphone. Such work provides a strong basis from which to lead into more specific analysis of context that is relevant to the current activity or software application. This is further elaborated upon in the context of mobile device usage by Tamminen, Oulasvirta, Toiskallio, and Kankainen (2004), who consider determining contextual information in mobile computing scenarios. Computers, gaming consoles and mobile devices have all become much more powerful in recent years and interface with a range of local and remote information sources. These information sources range from the traditional tactile input devices to accelerometers, touch screens, cameras, microphones and so on. The Nintendo Wii and Apple iPhone and iPod are prime examples of such low-cost, sensor-rich, powerful computational devices. The technology available in these devices, as well as those devices that can be further added into the chain, mean a wide range of contextual information can potentially be extracted from a game player, be they mobile or static. We consider that the foremost sources of contextual information come from the user themselves and from the surrounding environment in which the user is currently immersed. This is further ratified by Reynolds, Barry, Burke, and Coyle (2007) who also consider the importance and usage of contextual input parameters from these two domains in their own research.
247
Emotion, Content, and Context in Sound and Music
Information from the user is arguably the most useful data that can be acquired in determining the context of the user. This allows the researcher to begin to investigate factors such as their level of activity, stress levels, emotional state, for example. We propose mechanisms such as skin conductivity, motion and heart rate data that might be acquired directly from the user and would prove particularly useful in monitoring their contextual state. Factors in the environment around the user are likely to have an effect on their performance in a game, their general attitudes, and their emotional state. A number of metrics can provide suitable input to a software system to estimate environmental context. Environmental information includes the amount of ambient noise, light levels, time of day and year, temperature and so forth. We feel that the devices and information in this scenario are relevant to many contextual extraction applications, not only those of digital entertainment and games. Hopefully, the reader can begin to gain an insight into the usefulness of contextual information from the examples and discussions in this section of the chapter. The next section of this chapter seeks to exemplify how context (as well as emotion and content) can be employed in digital multimedia applications, especially those that relate to sound. We feel that, in computer games, the virtual gameplay environment can be tailored to reflect the real environment of the player. In all, this will provide a deeper, more immersive experience: this will help the player to develop greater emotional and personal investment in the game. It will also be interesting to see if such a game can contribute to altering the emotional state of the user and impact upon their own personal context. For example, can games be designed that would relax a user, reduce their stress levels and heart rate, and even make them alter their surrounding environment to reflect their new, calmer state? Only through more contextual awareness and pervasive interactions with games will we know the answer to this question and others.
248
DEtErMINING UsEr PErcEPtIONs OF MUsIc In this section of the chapter, we aim to gain more of an insight into how emotion, content and context are attached to music by human listeners. By investigating the various perceptions and semantic terms users relate to different musical genres, it is possible to gain a deeper understanding of the ways in which humans relate their emotions, musical content and the context of different types of music. Wide ranges of semantics are frequently employed to portray musical characteristics and range from technically-related terms to experiential narratives (Károlyi, 1999). It is proposed that the characteristics of a piece of music are difficult to quantify in a single term or statement. Whilst high-level abstractions may be possible that categorise the music or provide an overview of the timbre, this is a highly subjective and individual (and potentially emotionally influenced) expression of a listener’s experience of the music. Such an investigation also allows groupings to be applied to terms, understanding to be formed and a mapping of the relationships between these groups to be formed.
repertory Grid technique In order to extract common descriptive features and semantics that are most meaningful and globally understood, it is better to employ a technique where the listener subject may employ their own descriptions of the elements under investigation. George Kelly’s work (1955) into personal construct theory (PCT) and personal construct psychology (PCP) provides a suitable mechanism, known as repertory grid analysis, by which such descriptions can be elicited from subjects, correlated and employed in measurement subject experiences. Kelly’s work in this area is grounded in principles of constructivism, where subjects identify and deal with the world
Emotion, Content, and Context in Sound and Music
around them based upon their own experiences and interpretation of events and objects. Repertory grid analysis consists of defining a particular subject or domain to be investigated within a particular context. Descriptions of instances or examples of the domain are known as elements and bi-polar descriptions of the elements, known as constructs, are rated on a scale (usually 5 or 7 point). For example, the domain being investigated might be movies and the elements could be a number of popular movies and the bi-polar constructs used to describe and rate the movie elements could be violent or non-violent, an adult’s film, or a children’s film and so on. Constructs are defined by the subject with the help of the interviewer, who enables the subject to produce more constructs by defining the relationships and differences between the nominated elements. This can be enhanced through interview techniques involving triads, where three random elements are chosen and the subject asked to choose the least similar of the three and define the construct that separates it from the other two elements (Bannister & Mair, 1968). Subjects then provide a rating on the point scale for each element against each of the constructs they have defined in order to complete their grid. Alternatively, and particularly of use when subjects struggle to separate two elements from a third, the interviewer can also find it useful to present a subject with two elements and ask them to explain the factor that differentiates the two elements. This will often provide one pole of a construct and the subject is then asked for what the opposite side of that particular construct would be. Once a desired number of subjects have completed a repertory grid each, the grids are then
concatenated and can be immediately visualised as one large grid but, more crucially, the opportunity is available to determine the importance of elements and constructs within the larger grid. The bi-polar nature of defining constructs allows the context and relationship of a construct to be articulated and better understood by the researcher. This further removes ambiguity when a subject provides a rating, since the interrelation between the opposing ends of the scale have been specified by the subject themselves (Kelly, 1955). Further detail of PCP and repertory grid technique goes well beyond this work and can be found in Kelly’s seminal text.
Using a repertory Grid to Understand Perceptions of Music A set of elements was defined to include a representative spectrum of musical genres upon which the subjects would be asked to define their own bi-polar personal constructs in regard to their experiences and perceptions of the characteristics of those genres, in their experiences of listening to music. Whilst there are many sub-genres and pseudo-related musical styles, this provides an appropriate, broad spread for the purposes of this particular investigation without making the interview process for the participant overly laborious in terms of time and effort. The elements defined were those shown in Table 3. Subjects for the investigation were drawn from a random sample of the population. Subjects were interviewed on an individual basis and told that the purpose of the exercise was to get them to express their perceptions about the characterising features of different type of music. To carry out
Table 3. Musical Genre Elements Used in Repertory Grid Experiment • Pop
• Rock
• Dance
• Jazz
• Classical
• Soul
• Blues
• Rap
• Country
249
Emotion, Content, and Context in Sound and Music
the rating of elements against their defined constructs, subjects were asked to perform a card sorting exercise for each pair of constructs. The use of triads was made to elicit the choice of constructs by randomly selecting 3 elements and once subjects began to struggle with the use of triads they were asked to differentiate between 2 randomly selected elements. A total of 10 subjects were selected to participate in the elicitation process of the repertory grid interview. The age of subjects interviewed ranged from 16 to 59, with the average age being 34, and there was a 50/50 male/female gender split. The results of the ten repertory grid interviews are presented in Figure 5 and Figure 6. Though the number of subjects involved in the repertory grid interviews appears to be a low population sample at first glance, the granularity from these interviews comes from the sum number of constructs elicited across all participants. Furthermore, the data retrieved using constructs provides both qualitative and quantitative information regarding the domain of enquiry. In addition to the visual analysis of a repertory grid, a PrinCom map, which makes use of Principal Component Analysis (PCA), can be derived that relates elements and constructs in a graphical fashion where the visual distance between elements and constructs is significant. The PrinCom mapping integrates both elements and constructs on a visual grid and shows the relationship between the two. A PrinGrid for the repertory grid derived in this investigation is shown in Figure 7. It is the constructs elicited that are particularly of interest within the scope of this work. The constructs used by subjects provide insight into how they perceive music. As can be seen from the grid in Figure 5 and Figure 6, the range of constructs elicited provides an insight into, not only how subjects typically perceive the sound content of each musical genre but also, terms relating to the context in which they place each genre and occasional indications of the emotional impact of each genre. For example, by also 250
looking at Figure 7 we can produce the notion that blues music is “emotionally evocative”, has “specific geography & history”, is placed in the context of being “African American”, and is “mellow”. Naturally, there is some subjectivity present here and these statements are open to interpretation, but to most readers it is expected that these constructs should represent the group norm. A perceived limitation of the repertory grid technique to have been encountered during this particular study is that of familiarity with the elements under investigation by subjects, during repertory grid interview. During interviews there were clearly some elements that subjects were definitely not as familiar with as others. It was observed that subjects would often group together the elements they were less familiar with when rating elements against their chosen constructs. Whilst it is appreciated that this phenomenon is likely to be particularly present in this study, due to music awareness firmly depending upon personal preference or taste, it is doubtless likely to occur in other scenarios. Using a repertory grid sought to elicit humanfriendly descriptions of musical characteristics. Although not strictly timbral definitions, these constructs succeed in describing the characteristics of musical genres. To put this into the context of artistic definitions with the notion of a visual metaphor, whilst timbre is a human description of the colour of a sound or piece of music, these constructs can be thought of as describing the patterns; the mix of shapes and colours that provides deeper information about the content and the bigger picture We find repertory grid investigation to be a highly useful tool in determining group norms and perceptions of important factors in any field that is being explored. In the context of this chapter, it can hopefully be seen that using such techniques would allow information about how a group of users would perceive a game and game sound in terms of the content that constitutes the game along with their emotional perceptions of the game and the context in which they view it.
Emotion, Content, and Context in Sound and Music
Figure 5. Repertory Grid Ratings of Musical Genres (Part a).
EXAMPLE APPLIcAtIONs IN cUrrENt rEsEArcH Presented in this section are summaries from some research work that has been influenced or involved
by the use of emotion, content and context in various guises. A number of the works presented here have been studies involving a small set of music. For convenience, this small database of music is shown in Table 4 so that the reader may
251
Emotion, Content, and Context in Sound and Music
Figure 6. Repertory Grid Ratings of Musical Genres (Part b).
252
Emotion, Content, and Context in Sound and Music
Figure 7. Musical Genre Principal Components Analysis.
cross-reference the ID number to the song, where appropriate. We feel that this small selection of songs represents a reasonable cross-section of contemporary popular musical genres.
responsive Automated Music Playlists Some of our most recent and cognate work combining the use of emotion, content and context in musical applications has been in the area of intelligent playlist generation tools and this work is explored in greater detail in a separate work (Cunningham, Caulder, & Grout, 2008). However, to see the effectiveness of combining all three of these areas, the reader is provided here with a summary of that work to date.
Our main motivation in this area of research and development was to address some of the shortcomings traditionally employed in automatic recommendation and playlist generation tools. Historically, these tools evolved in a similar way to that of Automated Collaborative Filters (ACFs). That is to say, simple measurements of user preference and the preference of a typical population were used to build a ranked table of music in a library. These analysed information such as the most frequently played tracks, a user rating of each track, favourite artists and musical genres, and other meta-data attached to a song (Cunningham, Bergen, & Grout, 2006). However, this is not to totally trivialise the area of automatic playlist generation, since a number of systems exist that employ much more advanced learning and
253
Emotion, Content, and Context in Sound and Music
Table 4. Mini Music Database Used in Testing ID
Artist
Song
0
Daft Punk
One More Time (Radio Edit)
1
Fun Lovin’ Criminals
Love Unlimited
2
Hot Chip
Over and Over
3
Metallica
Harvester of Sorrow
4
Pink Floyd
Comfortably Numb
5
Sugababes
Push The Button
6
The Prodigy
Breathe
7
ZZ Top
Gimme All Your Lovin’
analysis techniques and technologies (Aucouturier & Pachet, 2002; Platt, Burges, Swenson, Weare, & Zheng, 2002). In recent years, as computational power and resources have increased, the tools that underpin musical and sound content analysis have migrated deeper into the field of playlist generation, allowing greater scope and accuracy for classification of musical features (Logan, 2002, Gasser, Pampalk, & Tomitsch, 2007). However, although these advances have been significant, such methods have always focused on musical and sonic information extraction and few systems have considered the wider scope of the user and his or her environment. It is reasonable to expect that factors relating to the emotion, state of mind and current activities of a listener will greatly influence their current and, most importantly, next selection of music.
Emotion, Content and Context in Automatic Playlist Generation In this review of our recent work in the area of intelligent automatic playlist generation, we provide details of the development principles, investigation and analysis into the viability of playlist generation that considers the wider circumstance of the listener. To achieve more accurate and useful playlist generation, we propose to not only build upon the established principles of using information about the musical content but also
254
to examine the context the listener is in and how these external factors might affect their choice of suitable music. Additionally, the emotional state of the user is of interest since this also is likely to influence their choice of music. To visually summarise the complete system we are describing here, a figure showing an idealised scenario is provided in Figure 8. To realise a system that will potentially have to deal with and correlate a wide range of input parameters, we employ approaches that utilise fuzzy logic and self-learning systems. Principally, determining the user’s emotional state is of key importance to a truly successful and useful playlist generation system. This is informed by and correlated with the state of the surrounding environmental factors for the user, as well as their current levels of movement, physical activity, heart rate, stress levels, and so on. Within our work, we felt that it was initially most important to investigate the current locomotive state of the user. For example, a user who is moving a lot and accelerating rapidly may be participating in an energetic activity such as running, dancing, cycling, or exercising in some way. It is feasible to suggest that most people listening to music in such scenarios would be likely to want to listen to music that reflects the nature of their physical motion such as music with a dominant, driving rhythm and high tempo, greater than 120 or 130 beats-per minute, for instance.
Emotion, Content, and Context in Sound and Music
Figure 8. Emotion, Content and Context Aware Playlist System
Given currently available technology, it is also relatively easy to find equipment that allows the measurement of a user’s movement. This was realised in our case by employing the wireless hand controller from the Nintendo Wii: the Wiimote. The Wiimote, when compared to other alternatives, is a cheap device that allows measurement of three-dimensions of movement. The Wiimote is almost universally accessible since it employs the Bluetooth communication protocol to send and receive data to a paired host. As Maurizio and Samuele (2007) demonstrate, valuable motion information can be retrieved through the accelerometers contained in the Wii controller.
Implementation and Initial Results Our initial work in this field sought to demonstrate the ability to attain, analyse and correlate content-related data about music and contextual
information acquired from the user to arrive at an estimate of the user’s emotional state (E-state). To achieve this, we developed a small scale system that would work from a number of simulated factors (controlled by the researcher) and also live data extracted from sensors, principally the Wii controller. This system was designed to work with a small music database consisting of eight songs, shown in Table 4, and rank these songs in order of most suitable, based upon the estimated E-state. To begin working with the motion data from the Wiimote controller, we attempted to work with four simple locomotive states: standing, walking, jogging, and running. These simple locomotive states were believed to be detectable from not only the Wiimote but a range of motion measurement devices such as the accelerometers built into the Apple iPhone/iPod, as well as higher-level systems such as a Qualisys motion capture system (which we also had access to and allows us to verify the
255
Emotion, Content, and Context in Sound and Music
Table 5. Defined Set of Emotional States (E-states) Depressed
E-State Numeric Range
0-3
results obtained from the Wiimote). Similarly, for each of the other input parameters we work with, such as weather conditions, amount of light and so on, a range of states was also defined. As previously mentioned, due to our current limitations of time and resources, we focused on only the locomotive state of the user. It is hoped that in future, further ratification will be achieved by using other user measurements such as heart rate and galvanic skin response. To avoid additional complication, and allow greater control over the testing procedure, this set of parameter ranges and values was loosely defined according to the empirical and historic knowledge of the individual. However, this is too fixed and logical to fit the way things tend to work in the real world: therefore, to make these values more representative of real scenarios, they are fuzzified, when defined in the fuzzy logic system. In simple terms, this means the boundaries and degree of accuracy of each point on a scale is related within the range of all available values. The implemented fuzzy logic system provides a single output value that represents the predicted E-state of the listener. For simplification, we began by defining five emotional states and assigned a numeric range to each state to allow the representation of varying degrees of this state and the overlap where states merge into one another. This is appropriate, since it is very difficult to place an absolute, quantitative metric onto the complex emotions felt by humans. A table representing these allocations is shown in Table 5. To verify the effectiveness of using the Wii controller as a device to measure movement, and particularly locomotive state, we performed a number of experiments benchmarked against a
256
Unhappy 3-4
Neutral 4-6
Happy 5-8
Zoned 7-9
Qualisys motion detection system. By determining the rate of acceleration from the accelerometers in the Wii controller and asking a subject to provide three locomotive states (walking, jogging, and running) we can see that the Wii controller allows rapid identification of each of these states, as the graph in Figure 9 shows. We defined a number of different scenarios, that combined a range of contextual parameters which a user might typically find himself or herself in. Each of these scenarios is shown in Table 6. These parameters included those that the user has control over, in this case the locomotive state, and external, environmental factors, beyond the control of the user. We carried out a quantitative investigation, using 10 subjects, where each subject was asked to map one of the emotional states from Table 5 against one of the scenarios from Table 6. From this investigation, the average Estate response rating for each scenario is shown in Table 7. Although the use of an average response is not ideal, it provides a sufficient insight into the common perception of each scenario and when we performed analysis of each response, there was strong correlation across all of the subjects. Each of the songs in the mini-database was allocated a range of values from the E-state table by the research team. These values reflected the researchers’ perceptions of the content and emotional indicators present in the music. Naturally, in future research, we will explore the perceived emotional state attached to each song, by employing a more detailed sample of a suitable population. However, for now, these allocations were decided to be a controlled factor in this particular investigation. These allocations were then mapped
Emotion, Content, and Context in Sound and Music
Figure 9. Wii Acceleration Curves for 3 States of Locomotion
against the emotional states extracted from the user-scenario study and each song’s E-state was ranked against each scenario’s E-state to provide a grade for each song, using a simple Euclidean distance measurement of the form G (p, q ) =
(p − q ) . 2
(1)
Table 8 shows the resultant ranking of songs (by their ID), for each of these scenarios.
At this stage, it is recognised that the system is currently more limited than the idealised scenario presented earlier in Figure 8. A number of factors have been simulated and a number of assumptions have been made. However, using a Fuzzy Rule Based System (FRBS) we have been able to implement a limited version of the system outlined in that dynamically outputs an E-state based on live sensor data from a Wii controller and simulated environmental parameters. An outline of the Takagi-Sugeno-Kang (TSK) type
Table 6. Range of Scenarios in Subject E-state Evaluation ID
Scenario Description
1
Walking, temperature is hot, lighting is dark/grey and weather is light rain.
2
Stationary, temperature is cold, lighting is dark and weather is raining.
3
Stationary, temperature is warm, lighting is brightening and weather is dry.
4
Running, temperature is hot, lighting is daylight/getting brighter and dry.
5
Walking, temperature is getting hot, lighting is dark and weather is drizzling.
6
Stationary, temperature is hot, lighting is grey and weather is dry.
7
Walking/Jogging, temperature is mild, daylight and it’s dry.
257
Emotion, Content, and Context in Sound and Music
Table 7. Average User E-state for each Scenario Scenario ID
Emotion (0 = Unhappy; 100 = Very Happy)
1
4.3
2
0.0
3
6.8
4
7.7
5
3.0
6
3.8
7
6.5
FRBS we employed, along with the fuzzified input parameters, is shown in Figure 10. Whilst a number of parameters such as light and temperature measurement have been simulated at this stage, the implementation of live sensor data from such devices is a trivial one and is only limited by the current lack of the hardware resources to incorporate these devices into the live system. In its current state the system demonstrates the ability to read contextual data from the user and correlate this with information from the environment to make an informed judgement of the user’s emotional state. With future development and also feedback from the user whilst using the system, the facility will be available to teach the playlist generator about the user’s preferences and train the accuracy with which the system is able to estimate the emotional state of the user.
FUtUrE rEsEArcH IDEAs AND cONcLUsIONs We have seen that awareness of the presented issues is beneficial in not only providing richer interactive experiences and more appropriate information, but that knowledge of the purpose of information can be used to optimise computational challenges. Furthermore, information, such as that presented visually, can be analysed in terms of content and context with the goal of being able to stimulate emotions in a user who might otherwise be disadvantaged from such an experience, due to visual impairment. It is hoped that we have demonstrated the applicability of determining and analysing features related to emotion, content and context in relation to improving systems where user-interaction is of particular significance.
Table 8. Ranked Playlist Ordering for Set of Given Scenarios Scenario
Playlist order
E-state
1
4.3
1; 4; 0; 7; 3; 6; 2; 5
2
0
6; 3; 4; 1; 0; 7; 2; 5
3
6.8
0; 7; 2; 5; 1; 4; 3; 6
4
7.7
2; 5; 0; 7; 1; 4; 3; 6
5
3
4; 3; 6; 1; 0; 7; 2; 5
6
3.8
4; 1; 3; 6; 0; 7; 2; 5
7
6.5
0; 7; 1; 2; 5; 4; 3; 6
258
Emotion, Content, and Context in Sound and Music
Figure 10. Simplified Overview of TSK-type FRBS Used in Playlist Generation
For the budding researcher interested in these areas, we suggest the following broad, non-exhaustive, list of some of the key research themes and areas that would greatly benefit from further investigation: •
•
•
•
•
Explore the commonly perceived emotions of users playing computer games and determine the most reliable methods with which to record and model these emotions Develop a common software toolbox to allow audio content analysis to be easily bolted into a range of software products Further examine the value of environmental context for users playing computer games compared with user-centred contextual data Assess gameplay parameters that are best influenced and reflected in the emotion, content and context of the user Develop fuzzy logic systems that can accurately read a range of content and contex-
tual data and output a robust, truly reflective emotional state for the majority of a sample user population. Above all, we hope that in reading this chapter we have stimulated intellectual thought and got the creative juices flowing. Our aim here is not to provide a cast-in-stone set of data and instructions for the budding developer and researcher to follow and obey: far from it! By all means question, criticise and make up your own mind. Anyone working in the field of computer game and multimedia development that involves sound will not only be aware of the technical, computing, and engineering issues of their field (the logical ones) but they will doubtless have opinions and their own tastes and creativity. If there is a lesson to be learned from this chapter, it is that we hope you will consider the bigger picture, the wider implications, the external factors and the notion of a user-centred design process. We feel that the three areas of emotion, content and context
259
Emotion, Content, and Context in Sound and Music
epitomise these views and will be the crucial issues in future technological and entertainment areas. Think big. In our opinion, the ‘blue sky’ and ‘off the wall’ ideas are some of the most fun and interesting things you can do when it comes to being creative with technology. Have fun with your work and work with fun stuff!
Cunningham, S., Grout, V., & Hebblewhite, R. (2006). Computer game audio: The unappreciated scholar of the Half-Life generation. In Proceedings of the Audio Mostly Conference on Sound in Games.
rEFErENcEs
Davies, G., Cunningham, S., & Grout, V. (2007). Visual stimulus for aural pleasure. In Proceedings of the Audio Mostly Conference on Interaction with Sound.
Aucouturier, J. J., & Pachet, F. (2002). Scaling up music playlist generation. In Proceedings of the IEEE International Conference on Multimedia Expo.
Dance dance revolution. (1998). Konami. Dave mirra freestyle BMX. (2000). Z-Axis.
Davis, H., & Silverman, R. (1978). Hearing and deafness (4th ed.). Location: Thomson Learning.
Bannister, D., & Mair, J. M. M. (1968). The evaluation of personal constructs. London: Academic Press.
Dix, A., Finlay, J., Abowd, G. D., & Beale, R. (2003). Human computer interaction (3rd ed.). Essex, England: Prentice Hall.
Battle of the bands. (2008). Planet Moon Studios.
Ekman, I. (2008). Psychologically motivated techniques for emotional sound in computer games. In Proceedings of the Audio Mostly Conference on Interaction with Sound.
BioShock. (2007). Irrational Games. Bordwell, D., & Thompson, K. (2004). Film art: An introduction (7th ed.). New York: McGrawHill. Clarkson, B., Mase, K., & Pentland, A. (2000). Recognizing user context via wearable sensors. In Proceedings of the Fourth International Symposium of Wearable Computers. Conati, C. (2002). Probabilistic assessment of user’s emotions in educational games. Applied Artificial Intelligence, 16(7/8), 555–575. doi:10.1080/08839510290030390 Cunningham, S., Bergen, H., & Grout, V. (2006). A note on content-based collaborative filtering of music. In Proceedings of IADIS - International Conference on WWW/Internet. Cunningham, S., Caulder, S., & Grout, V. (2008). Saturday night or fever? Context aware music playlists. In Proceedings of the Audio Mostly Conference on Interaction with Sound.
260
F.E.A.R. First encounter assault recon. (2005). Monolith Productions. FIFA. (1993-). EA Sports Foote, J. (1999). Visualizing music and audio using self-similarity. Proceedings of the seventh ACM international conference on Multimedia (Part 1), 77-80. Freeman, D. (2004). Creating emotion in games: The craft and art of emotioneering™. Computers in Entertainment, 2(3), 15. doi:10.1145/1027154.1027179 Gasser, M., Pampalk, E., & Tomitsch, M. (2007). A content-based user-feedback driven playlist generator and its evaluation in a real-world scenario. In Proceedings of the Audio Mostly Conference on Interaction with Sound. Grand theft auto. (1993-). Rockstar Games.
Emotion, Content, and Context in Sound and Music
Grimshaw, M., Lindley, C. A., & Nacke, L. (2008). Sound and immersion in the first-person shooter: Mixed measurement of the player’s sonic experience. In Proceedings of the Audio Mostly Conference on Interaction with Sound. Guitar hero. (2005-). [Computer software]. Harmonix Music Systems (2005- 2007)/ Neversoft (2007-). Hitchcock, A. (Director) (1960). Psycho. Hollywood, CA: Paramount. Jansz, J. (2006). The emotional appeal of violent video games. Communication Theory, 15(3), 219– 241. doi:10.1111/j.1468-2885.2005.tb00334.x Johnstone, T. (1996). Emotional speech elicited using computer games. In Proceedings of Fourth International Conference on Spoken Language (ICSLP96). Jørgensen, K. (2011). Time for new terminology? Diegetic and non-diegetic sounds in computer games revisited . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global. Károlyi, O. (1999). Introducing music. Location: Penguin. Kelly, G. A. (1955). The psychology of personal constructs. New York: Norton. Kromand, D. (2008). Sound and the diegesis in survival-horror games. In Proceedings of the Audio Mostly Conference on Interaction with Sound. Livingstone, S. R., & Brown, A. R. (2005). Dynamic response: Real-time adaptation for music emotion. In Proceedings of the Second Australasian Conference on Interactive Entertainment. Logan, B. (2002). Content-based playlist generation: Exploratory experiments, In ISMIR2002, 3rd International Conference on Musical Information (ISMIR).
Maurizio, V., & Samuele, S. (2007). Lowcost accelerometers for physics experiments. European Journal of Physics, 28, 781–787. doi:10.1088/0143-0807/28/5/001 Moore, B. C. J. (Ed.). (1995). Hearing: Handbook of perception and cognition (2nd ed.). New York: Academic Press. Moore, B. C. J. (2003). An introduction to the psychology of hearing (5th ed.). New York: Academic Press. Nacke, L., & Grimshaw, M. (2011). Player-game interaction through affective sound . In Grimshaw, M. (Ed.), Game Sound Technology and Player Interaction: Concepts and Developments. Hershey, PA: IGI Global. Parker, J. R. & Heerema, J. (2008). Audio interaction in computer mediated games. International Journal of Computer Games Technology. Platt, J. C., Burges, C. J. C., Swenson, S., Weare, C., & Zheng, A. (2002). Learning a Gaussian process prior for automatically generating music playlists. Advances in Neural Information Processing Systems, 14, 1425–1432. Pong. (1972). Atari Inc. Ravaja, N., Saari, T., Laarni, J., Kallinen, K., Salminen, M., Holopainen, J., & Järvinen, A. (2005). The psychophysiology of video gaming: Phasic emotional responses to game events. In Proceedings of DiGRA 2005 Conference: Changing Views - Worlds in Play. Reynolds, G., Barry, D., Burke, T., & Coyle, E. (2007). Towards a personal automatic music playlist generation algorithm: The need for contextual information. In Proceedings of the Audio Mostly Conference on Interaction with Sound. Rock band. (2005-2007). Harmonix Music Systems.
261
Emotion, Content, and Context in Sound and Music
Schmidt, A., & Winterhalter, C. (2004). User context aware delivery of e-learning material: Approach and architecture. Journal of Universal Computer Science, 10(1), 38–46. Silent hill 2. (2001). Konami. SingStar. (2004). London Studio. Space invaders. (1978). Taito Corporation. Stigwood, R., & Badham, J. (Producers). (1977). Saturday night fever [Motion picture]. Hollywood, CA: Paramount. Sykes, J., & Brown, S. (2003). Affective gaming: Measuring emotion through the gamepad. In Proceedings of Conference on Human Factors in Computing Systems (CHI ‘03). Tamminen, S., Oulasvirta, A., Toiskallio, K., & Kankainen, A. (2004). Understanding mobile contexts. Personal and Ubiquitous Computing, 8(2), 135–143. doi:10.1007/s00779-004-0263-1 Ultimate band. (2008). Fall Line Studios. Wii Music. (2008). Nintendo. Yost, W. A. (2007). Fundamentals of hearing: An introduction (5th ed.). New York: Academic Press. Zhang, T., & Jay Kuo, C. C. (2001). Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing, 9(4), 441–457. doi:10.1109/89.917689
ADDItIONAL rEADING Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer. Brewster, S. A. (2008). Nonspeech auditory output . In Sears, A., & Jacko, J. (Eds.), The human computer interaction handbook (2nd ed., pp. 247–264). Philadelphia: Lawrence Erlbaum Associates.
262
Cunningham, S., & Grout, V. (2009). Audio compression exploiting repetition (ACER): Challenges and solutions. In Proceedings of the Third International Conference of Internet Technologies and Applications (ITA 09). Ekman, I., & Lankoski, P. (2009). Hair-raising entertainment: Emotions, sound, and structure in Silent Hill 2 and Fatal Frame . In Perron, B. (Ed.), Gaming after dark. Welcome to the world of horror video games (pp. 181–199). Jefferson, NC: McFarland & Company, Inc. Freeman, D. (2003). Creating emotion in games. Indianapolis, IN: New Riders. Grimshaw, M. (2008). The acoustic ecology of the first-person shooter: The player, sound and immersion in the first-person shooter computer game. Saarbrücken, Germany: VDM Verlag Dr. Mueller. Grimshaw, M. (2009). The audio Uncanny Valley: Sound, fear and the horror game. In Proceedings of the Audio Mostly Conference on Interaction with Sound. Loy, G. (2006). Musimathics: The mathematical foundations of music (Vol. 1). Cambridge, MA: MIT Press. Loy, G. (2007). Musimathics: The mathematical foundations of music (Vol. 2). Cambridge, MA: MIT Press. Papworth, N., Liljedahl, M., & Lindberg, S. (2007). Beowulf: A game experience built on sound effects. In Proceedings of the 13th International Conference on Auditory Display (ICAD). Röber, N., & Masuch, M. (2005). Leaving the screen: New perspectives in audio-only gaming. In Proceedings of 11th International Conference on Auditory Display (ICAD).
Emotion, Content, and Context in Sound and Music
KEY tErMs AND DEFINItIONs Content: The definable qualities and characteristics for any given piece of information. Context: The scenario and environment in which a user or application is placed in. Emotional Interaction: A digital system capable of inducing emotional reactions in a user and being able to dynamically respond to human emotional states.
Emotional Reaction: A human affective response or feeling in response to one or more stimuli. Emotional State: The dominant, overriding emotional sensation of a human at a given moment. Playlist Generation: The production of a sequence of songs, often to be listened to on a portable music player.
263
264
Chapter 13
Player-Game Interaction Through Affective Sound Lennart E. Nacke University of Saskatchewan, Canada Mark Grimshaw University of Bolton, UK
AbstrAct This chapter treats computer game playing as an affective activity, largely guided by the audio-visual aesthetics of game content (of which, here, we concentrate on the role of sound) and the pleasure of gameplay. To understand the aesthetic impact of game sound on player experience, definitions of emotions are briefly discussed and framed in the game context. This leads to an introduction of empirical methods for assessing physiological and psychological effects of play, such as the affective impact of sonic playergame interaction. The psychological methodology presented is largely based on subjective interpretation of experience, while psychophysiological methodology is based on measurable bodily changes, such as context-dependent, physiological experience. As a means to illustrate both the potential and the difficulties inherent in such methodology we discuss the results of some experiments that investigate game sound and music effects and, finally, we close with a discussion of possible research directions based on a speculative assessment of the future of player-game interaction through affective sound.
INtrODUctION Digital games have grown to be among the favourite leisure activities of many people around the world. Today, digital gaming battles for a share of your individual leisure time with other traditional activities like reading books, watching movies, listening to music, surfing the internet, or DOI: 10.4018/978-1-61692-828-5.ch013
playing sports. Games also impose new research challenges to many scientific disciplines – old and new – as they have been hailed as drivers of cloud computing and innovation in computer science (von Ahn & Dabbish, 2008), promoters of mental health (Miller & Robertson, 2009; Pulman, 2007), tools for training cognitive and motor abilities (Dorval & Pepin, 1986; Pillay, 2002), and as providers of highly immersive and emotional environments for their players (Ravaja, Turpeinen,
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Player-Game Interaction Through Affective Sound
Saari, Puttonen, & Keltikangas-Järvinen, 2008; Ryan, Rigby, & Przybylski, 2006). Gaming is a joyful and affective activity that provides emotional experiences and these experiences may guide how we process information. Regarding emotions, Norman’s (2004) definition is that emotion works through neurochemical transmitters which influence areas of our brain and successively guide our behaviour and modify how we perceive information and make decisions. While Norman makes a fine distinction between affect and cognition, he also suggests that both are information-processing systems with different functionality. Cognition refers to making sense of the information that we are presented with, whereas affect refers to the immediate “gut reaction” or feeling that is triggered by an object, a situation, or even a thought. Humans strive to maximize their knowledge by accumulating novel, but also interpretative information. Experiencing novel information and being able to interpret it may be a cause of neurophysiological pleasure (Biedermann & Vessel, 2006). Cognitive processing of novel information activates endorphins in the brain, which moderate the sensation of pleasure. Thus, presenting novel cues in a game environment will affect and mediate player experience and in-game learning. This is an excellent example of how cognition and affect mutually influence each other, which is in line with modern emotion theories (Damasio, 1994; LeDoux, 1998; Norman, 2004). Norman (2004) proceeds to define emotion as consciously experienced affect, which allows us to identify, who (or what) caused our affective response and why. The problem of not making a clear distinction between emotion and affect is further addressed by Bentley, Johnston, & von Baggo (2005), who recall Plutchik’s (2001) view on emotion as an accumulated feeling which is influenced by context, experience, personality, affective state, and cognitive interpretation. They also explain that user experience for desktop software or office-based systems is more dependent on performance factors while, for digital games,
user experience depends much more on affective factors. Affect is defined as a discrete, conscious, subjective feeling that contributes to, and influences, an individual’s emotion (Bentley, et al., 2005; Damasio, 1994; Russell, 1980). We will revisit this notion later in the text. In addition, Moffat (1980) introduced an interesting notion about the relationships between personality and emotion, which are distinguished along the two dimensions: duration (brief and permanent) and focus (focused and global). For example, an emotion might develop from brief affection into a long-term sentiment or a mood that occurs steadily might become a personality trait. The two dimensions can be plausibly identified at a cognitive level, making a strong case for the relation between emotion, cognition, and personality both at the surface and at a deep, structural level. Psychophysiological research shows that affective psychophysiological responses elicit more activity (on facial muscles such as corrugator supercilii, indicating negative appraisal) and higher arousal when people have to process unpleasant sound cues (e.g., bomb sounds), which shows that sound cues can be used in games to influence players’ emotional reactions (Bradley & Lang, 2000). Sound and music are generally known to enhance the immersion in a gaming situation (Grimshaw, 2008a). To music has been attributed also a facilitation of absorption in an activity (Rhodes, David, & Combs, 1988), and it is generally know to trigger the mesolimbic reward system in the human brain (Menon & Levitin, 2005), allowing for music to function as a reward mechanism in game design and possibly allowing for reinforcement learning (Quilitch & Risley, 1973). The recent explosion of interactive music games is a testament to the pleasureenhancing function of music in games. Examples for interactive music games are Audiosurf (2008), the Guitar Hero series (2005-2009), SingStar (2004), or WiiMusic (2008). They make heavy use of reinforcement learning, as both positive
265
Player-Game Interaction Through Affective Sound
and negative reinforcement are combined when learning to play a song on Guitar Hero (2005) for example (for a comprehensive list of interactive music games see the list at the end of this chapter). Hitting the button and strumming with the right timing leads to positive reinforcement in the way that the guitar track of the particular song is played back and suggests player finesse, while a cranking sound acts as negative reinforcement when the button and strumming are off. Such reward mechanisms that foster reinforcement learning are a very common design element in games (see Collins, Tessler, Harrigan, Dixon, & Fugelsang, 2011). Applying them to diegetic composition of music is new and warrants further study as sound and music effects in games are currently not studied with the same scientific rigour that is present for example in the study of violent digital game content and aggression (Bushman & Anderson, 2002; Carnagey, Anderson, & Bushman, 2007; Ferguson, 2007; Przybylski, Ryan, & Rigby, 2009). In addition to the reinforcement learning techniques in game design, another design feature is what Bateman (2009, p. 66) calls toyplay, facilitating the motivation of playing for its own sake. Toyplay denotes an unstructured activity of play guided by the affordances of the gameworld and is largely of an exploratory nature (Bateman, 2009; Bateman & Boon, 2006) being similar to games of emergence (Juul, 2005, p. 67) and unstructured and uncontrolled play termed “paida” (Caillois, 2001, p. 13). Many music games work completely without a narrative framing and derive the joy of playing simply out of their player-game interaction. For example, Audiosurf (2008) eliminates most design elements not necessary for the interaction of the player with the game, which is essentially the production of music by “surfing” the right tones. The colourful representation of tones and notes is a visual aesthetic that drives the player to produce music. A simple concept brought to stellar quality in games such as Rez (2001) or SimTunes (1996), which truly appeal to the toyplay aspect of gaming. Therefore, toyplay
266
elements and reinforcement learning techniques are two design methods most pronounced in music interaction games and that drive affective engagement with sound and music. With recent efforts in the field of humancomputer interaction (Dix, Finlay, & Abowd, 2004), the sensing and evaluation of the cognitive and emotional state of a user during interaction with a technological system has become more important. The automatic recognition of a user’s affective state is still a major challenge in the emerging field of affective computing (Picard, 1997). Since affective processes in players have a major impact on their playing experiences, recent studies have emerged that apply principles of affective computing to gaming (Gilleade, Dix, & Allanson, 2005; Hudlicka, 2008). The field of affective gaming is concerned with processing of sensory information from players (Gilleade & Dix, 2004), adapting game content (Dekker & Champion, 2007) – for example, artificial behaviour of non-player character game agents to player emotional states – and using emotional input as a game mechanic (Kuikkaniemi & Kosunen, 2007). However, not much work has been put into sensing the emotional cues of game sounds in games (Grimshaw, Lindley, & Nacke, 2008), let alone in understanding the impact of game sound on players’ affective responses. We start by discussing general theories of emotion and affect and their relevance to games and psychophysiological research (for a more general introduction to emotion, see Cunningham, Grout, & Picking’s (2011) chapter on Emotion, Content & Context). For instance, we suggest it is emotion that drives attention and this has an important effect upon both engagement with the game and immersion (in those games that strive to provide immersive environments). Immersion is an important and current topic in games literature – rather than attempt to define it (that is attempted elsewhere in this book); we limit ourselves to a brief overview of immersion theories and their relationship to theories of emotion, flow,
Player-Game Interaction Through Affective Sound
and presence before discussing empirical studies and theoretical stratagems for measuring player immersion as aided by game sound. Once we can understand under what sonic conditions immersion arises, we can then design more precisely for immersion.
tHEOrIEs OF EMOtION Psychophysiological research, affective neuroscience as well as affective and emotive computing are supporting the assumption that a user’s (or in our case a player’s) affective state can be measured by sensing brain and body responses to experienced stimuli (Nacke, 2009). Emotions in this sense can be seen as psychophysiological processes, which are evoked by sensation, perception, or interpretation of an event and/or object which is referred to in psychology as a stimulus. A stimulus usually entails physiological changes, cognitive processing, subjective feeling, or general changes in behaviour. This is of general interest, since playing games includes all sorts of virtual events taking place in virtual environments containing virtual objects. Emotions blur the boundaries between physiological and mental states, being associated with feelings, behaviours and thoughts. No definitive taxonomy has been worked out for emotions, but several ways of classifying emotions have been used in the past. One of the first and most prominent theories of emotion is the JamesLange theory, which states that emotion follows from experiencing physiological alterations: The change of an outside stimulus (either event or object) causes the physiological change which then generates the emotional experience (James, 1884; Lange, 1912). The Cannon-Bard theory offered an alternative explanation of the processing sequence of emotions, stating that, after an emotion occurs, it evokes a certain behaviour based on the processing of the emotion (Cannon, 1927). Thus, the percep-
tion of a certain emotion is likely to influence the psychophysiological reaction. This theory already tries to account for a combination of cognitive and physiological factors when experiencing emotions, in which case an emotion is not purely physiological (i.e. it is separate from mental processing). Another important emotional concept is the two-factor theory of emotions which is based on empirical observations (Schachter & Singer, 1962) and considers emotions to arise from the interaction of two factors: cognitive labeling and physiological arousal (Schachter, 1964). In this theory, cognition is used as a framework within which individual feelings can be processed and labeled, giving the state of physiological arousal positive or negative values according to the situation and past experiences. These theories have spawned modern emotion research in neurology and psychophysiology (Damasio, 1994; Lang, 1995; LeDoux, 1998; Panksepp, 2004) which is gathering evidence for a strong connection between affective and cognitive processing as underlying factors of emotion in line with the definition of Norman (2004) which we initially provided.
From Emotions to Experience Modern emotion research typically uses one of two taxonomies which try to account for emotions as either consisting of a combination of a few fundamental emotions or as comprising different dimensions usually demarked by extreme characteristics on the ends of the dimensional scales: 1.
Emotions comprise a set of basic emotions. In the vein of Darwin (1899) who observed fundamental characteristic expressive movements, gestures, and sounds), researchers like Ekman (1992) and Plutchik (2001) argue for a set of basic discrete emotions, such as fear, anger, joy, sadness, acceptance, disgust, expectation, and surprise. Each basic emotion can be correlated to an individual
267
Player-Game Interaction Through Affective Sound
2.
physiological and behavioural reaction, for example a facial expression as Ekman (1992; Ekman & Friesen, 1978) found after studying hundreds of pictures of human faces with emotional expressions Emotions can be classified by means of a dimensional model. Dimensional models have a long history in psychology (Schlosberg, 1952; Wundt, 1896) and are especially popular in psychophysiological research. Wundt (1896) was one of the first to classify “simple feelings” into a threedimensional model, which consisted of the three fundamental axes of pleasure-displeasure (Lust-Unlust), arousal-composure
(Erregung-Beruhigung), and tension-resolution (Spannung-Lösung). A more modern approach and currently the most popular dimensional model was suggested by Russell (1980). His circumplex model (see Figure 1) assumes the possible classification of emotional responses in a circular order on a plane spanned by two axes, emotional affect and arousal. The mapping of emotions to the two dimensions of valence and arousal has been used in numerous studies (Lang, 1995; Posner, Russell, & Peterson, 2005; Watson & Tellegen, 1985; Watson, Wiese, Vaidya, & Tellegen, 1999) including studies of digital
Figure 1. The two-dimensional circumplex emotional model based on Russell (1980)
268
Player-Game Interaction Through Affective Sound
games (Mandryk & Atkins, 2007; Nacke & Lindley, 2008; Ravaja, et al., 2008). The current popularity of dimensional models of emotion in psychophysiology can be explained by the fact that Wundt (1896) was one of the first researchers to correlate physiological signals, such as respiration, blood-pressure, and pupil dilation with his “simple feelings” dimensions. Bradley and Lang (2007) note that discrete and dimensional models of emotion need not be mutually exclusive but, rather, these views of emotion could be seen as complementary to each other. For example, basic emotions can be classified within affective dimensions. Finding physiological and behavioural emotion patterns as responses to specific situations and stimuli is one of the major challenges that psychophysiological emotion research faces currently. However, new evidence from neurophysiological functional Magnetic Resonance Imaging (fMRI) studies supports the affective circumplex model of emotion (Posner et al., 2009), showing neural networks in the brain that can be connected to the affective dimensions of valence and arousal: in this case, affective pictures were used as stimuli. The measurement of emotions induced by sound stimuli in a game context is, however, more complex. To identify how a certain sound, or a game element in general, is perceived, a subjective investigation is necessary, usually done after the experimental session. Gathering subjective responses in addition to psychophysiological measurements of player affect allows cross-correlation and validation of certain emotional stimuli that may be present in a gaming situation. This ‘after-the-fact’ narration is not, however, without its self-evident problems. A further major challenge remains the distinction between auditory and visual stimuli within games, as many games evoke highly immersive, audio-visual experiences, which can also be influenced by setting, past experiences, and social context.
Thus, we suggest that for measurement of emotional responses to game sound, three broad strategies are available for a full, scientific comprehension of player experiences. This means that there are at least three ways of understanding the emotional player experience in games (each illustrated by a particular stratagem) but the third, being a combination of the previous two, is likely to be the most accurate: 1.
2.
3.
As objective, context-dependent experience – Physiological measures (using sensor technology) of how a player’s body reacts to game stimuli can inform our understanding of these emotions As subjective, interpreted experience – Psychological measures of how players understand and interpret their own emotions can inform our understanding of these emotions As subjective-objective, interpreted and contextual experience – Inferences drawn from physiological reactions and psychological measures allow a more holistic understanding of experience.
One of our primary research goals is to understand gaming experience, which has been connected to positive emotions (Clark, Lawrence, Astley-Jones, & Gray, 2009; Fernandez, 2008; Frohlich & Murphy, 1999; Hazlett, 2006; Mandryk & Atkins, 2007), but also to more complex experiential constructs like, for example, immersion (Calleja, 2007; Ermi & Mäyrä, 2005; Jennett, et al., 2008), flow (Cowley, Charles, Black, & Hickey, 2008; Csíkszentmihályi, 1990; Gackenbach, 2008; Sweetser & Wyeth, 2005) or presence (Lombard & Ditton, 1997; Slater, 2002; Zahorik & Jenison, 1998). Thus, we will provide an overview of the current understanding of immersion, flow and presence in games and then provide suggestions as to how this could be measured using objective and subjective approaches.
269
Player-Game Interaction Through Affective Sound
IMMErsION, FLOW, AND PrEsENcE In the fields of game science, media psychology, communication and computer science, many studies are concerned with uncovering experiences evoked by playing digital games. There is a lot of work directed towards investigating the potentials, definition, and limitations of immersion in digital games (Douglas & Hargadon, 2000; Ermi & Mäyrä, 2005; Jennett et al., 2008; Murray, 1995). A major challenge of studying immersion is defining what exactly is meant by the term “immersion” and how does it relate to similar game experience phenomena such as flow (Csíkszentmihályi, 1990), cognitive absorption (Agarwal & Karahanna, 2000) and presence (Lombard & Ditton, 1997; Slater, 2002).
From Immersion to Flow and Presence In a very comprehensive effort, Jennett et al. (2008; Slater, 2002) give an extensive conceptual overview of immersion. According to their definition, immersion is a gradual, time-based, progressive experience that includes the suppression of all surroundings (spatial, audio-visual, and temporal perception), together with attention and involvement mediating the feeling of being in a virtual world. This suggests immersion to be an experience related to cognitive processing and attention: the more immersive an experience is, the more attentionally demanding it is (see Reiter, 2011 for a discussion of attention and audio stimuli). One could hypothesize that emotional state drives attention (Öhman, Flykt, & Esteves, 2001) and therefore, the more affective an experience is, the more likely it is to grab individual attention and consequently to immerse the player. Thus, immersion would be elicited as the result of an action chain that starts with affect. This prompts an emotional response that influences attention and, as a consequence, leads to immersion. It remains to be shown whether, and how, affective
270
responses of players influence immersion and what measures of player affect are most suitable to evaluate immersion. Immersion is seen in some literature (Sweetser & Wyeth, 2005) – based on qualitative analysis – as an enabler of a fleeting experience of peak performance labeled flow (Csíkszentmihályi, 1990; Nakamura & Csíkszentmihályi, 2002). Flow is a little understood, but often-used experiential concept for describing one kind of game experience. Some examples from game studies and human-computer-interaction literature try to use flow for analyzing successful game design features of games (Cowley et al., 2008; Sweetser & Wyeth, 2005). However, originally, flow was conceived by Csíkszentmihályi (1975) on the basis of studies of intrinsically motivated behaviour of artists, chess players, musicians, and sports players. This group was found to be rewarded by executing actions per se, experiencing high enjoyment and fulfilment in the activity itself rather than, for example, being motivated by future achievement. Csíkszentmihályi describes flow as a peak experience, the “holistic sensation that people feel when they act with total involvement” (p. 36). Thus, complete mental absorption in an activity is fundamental to this concept, which ultimately makes flow an experience mainly found in situations with high cognitive loading accompanied by a feeling of pleasure. According to a more recent description from Nakamura and Csíkszentmihályi (2002), it should be noted that for entering flow, two conditions should be met: (1) a matching of challenges or action opportunities to an individual’s skill and (2) clear and close goals with immediate feedback about progress. Flow itself can be described through the following manifested qualities (which are admittedly too fuzzy for a clear evaluation using subjective or objective methods): (1) concentration focuses on present moment, (2) action and consciousness merge, (3) self-awareness is lost, (4) one is in full control over one’s actions, (5) temporal perception is distorted, and (6) doing the activity is rewarding
Player-Game Interaction Through Affective Sound
in itself (Nakamura & Csíkszentmihályi, 2002). Flow even shares some properties with immersion, such as a distorted temporal perception and lost or blurred awareness of self and surroundings. Jennett et al. (2008) argue that immersion can be seen as a precursor for flow experiences, thus allowing immersion and flow to overlap in certain game genres, while noting that immersion can also be experienced without flow: Immersion, in their definition, is the “prosaic experience of engaging with a videogame” (p. 643) rather than an attitude towards playing or a state of mind. One important question in the discussion about flow and immersion is whether flow is a state or a process. Defining flow as a static rather than a procedural experience would be in contrast to the process-based definitions of immersion such as the challenge-based immersion of Ermi and Mäyrä (2005). This kind of immersion oscillates around the success and failure of certain types of game interactions. Another important differentiation between flow and immersion is that immersion could be described as a “growing” feeling, an experience that unfolds over time and is dependent on perceptual readiness of players as well as the audio-visual sensory output capabilities of the gaming system. Past theoretical and taxonomical approaches have tried to define immersion as consisting of several phases or components. For example, Brown and Cairns (2004) describe three gradual phases of immersion: engagement, engrossment, and total immersion, where the definition of total immersion as an experience of total disconnection with the outside world overlaps with definitions of telepresence, where users feel mentally transported into a virtual world (Lombard & Ditton, 1997). The concept of presence is also discussed by Jennett et al. (2008) in relation to immersion, but defined as a state of mind rather than a gradually progressive experience like immersion. If we assume for a moment that immersion is an “umbrella” experience, immersion could incorporate notions of presence and flow at certain stages of its progress. It remains, however, unclear
through what phases immersion unfolds and what types of stimuli are likely to foster immersive experiences. In what situations is immersion likely to unfold and what situational elements make it progress? When does it reach its peak and how much immersion is too much? More research is needed to investigate such questions, as well as a possible link between high engagement and addiction, as studied by Seah and Cairns (2008) or the differences between high engagement and addiction as suggested in a study by Charlton and Danforth (2004).
the scI Immersion Model Ermi and Mäyrä (2005) subdivide immersive game experiences into sensory (as mentioned above), challenge-based and imaginative immersion (the SCI-model) based on qualitative surveys. The elements of this immersion model account for different facilitators of immersion, such as, the experience of elements (in a gaming context) through which immersion is likely to take place. The three immersive game experiences Ermi and Mäyrä give implicitly provide different immersion models of static state and progressive experience. Sensory immersion can be enhanced by amplifying a game’s audio-visual components, for example, using a larger screen, a surround-sound speaker system, or greater audio volume. If immersion is actually facilitated in this way, immersion would be an affective experience, as evidence points to the fact that enhanced audio-visual presentation results in an enhanced affective gaming experience (Ivory & Kalyanaraman, 2007). By jamming the perceptive systems of players (as a result of mental workload associated with auditory and visual processing of game stimuli), sensory immersion is probably also a facilitator of guiding player’s attention (see Reiter, 2011). This strengthens the hypothetical link between attentional processing and immersive feeling found in related literature (Douglas & Hargadon, 2000) but, while the link remains, the cognitive direction is the reverse of
271
Player-Game Interaction Through Affective Sound
those discussed earlier. Imaginative immersion describes absorption in the narrative of a game or identification with a character which is understood to be synonymous with feelings of empathy and atmosphere. However, atmosphere might be an agglomeration of imaginative immersion and sensory immersion (since certain sounds and graphics might facilitate a compelling atmospheric player experience): the use of this term raises the need for a clearer definition of the concept of atmosphere and this is not provided by Ermi and Mäyrä (2005). If ‘imaginative’ refers mainly to cognitive processes of association, creativity, and memory recall, it is likely to be facilitated by player affect. However, individual differences are huge when it comes to pleasant imagination (this is probably a matter of personal preference), which would make it very difficult to accurately assess this kind of immersion using empirical methodology. The last SCI dimension, viz. challenge-based immersion, conforms closely with one feature of Csíkszentmihályi’s (1990) description of flow. This is the only type of immersion in this model that suggests it might be progressive experience because challenge level is never simply static but is something that oscillates around the success and failure of certain types of interaction over time. If we assume now that immersion is linked to either successful or failed interactions in a game that are likely to strengthen or weaken the subjective feeling of immersion, we can try to establish the following relationship between game interactions and immersions. Given a number of successful interactions σ, a number of failed interactions φ, and incremental playing time τ, then two descriptions of the magnitude of immersion ι could be considered: (1) For σ, φ > 0: If σ > φ, then ι = σ/φ × τ. (2) For σ, φ > 0: If σ ≤ φ, then ι = σ/(φ × τ).
272
These equations would suggest that the longer people play with a higher success than failure rate, the more immersed they would feel. If the failure rate is higher than the success rate, the feeling of immersion for players will decrease over time. Many sonic interactions in games are implicitly challenge-based because they require interpretation (or are understood from previous experience), but an example of explicitly challenge-based sonic interaction in games is given by Grimshaw (2008a) in his description of the navigational mode of listening (p. 32). It remains to be tested whether such an equation could account for immersion itself or whether this would only measure one aspect of the immersive experience. Ideally, such a ratio would be extended and combined with psychophysiological variables that measure a player’s affective response over time.
Implications for Player-Game Interaction and Affective sound In the context of sound and immersion in computer games, other work investigates the role of sound in facilitating player immersion in the gameworld. A strong link between “visual, kinaesthetic, and auditory modalities” is hypothetically assumed to be key to immersion (Laurel, 1991, p.161). The degree of realism provided by sound cues is also a primary facilitator for immersion, with realistic audio samples being drivers of immersion (Jørgensen, 2006) similar to employing spatial sound (Murphy & Pitt, 2001) although some authors, as noted by Grimshaw (2008b) argue for an effect of immersion through perceptual realism of sound (as opposed to a mimetic realism) where verisimilitude, based on codes of realism, proves as effective if not more efficacious than emulation and authenticity of sound (see Farnell, 2011). Self-produced, autopoietic sounds of players, and the immersive impact that sounds have on the relationship between players and the virtual environment a game is played in, have been framed in discussions on acoustic ecolo-
Player-Game Interaction Through Affective Sound
gies in first-person shooter (FPS) games which provide a range of conceptual tools for analyzing immersive functions of game sound (Grimshaw, 2008a; Grimshaw & Schott, 2008). In an argument for physical immersion of players through spatial qualities of game sound (Grimshaw, 2007), we find the concept of sensory immersion reoccurring (Ermi & Mäyrä, 2005). The perception of game sound in this context is not only loading player’s mental and attentional capacities but is also having an effect on the player’s unconscious emotional state. The phenomenon of physical sonic immersion is not new, but has been observed before for movie theatre audiences and the concept has been transferred to sound design in FPS simulations and games (Shilling, Zyda, & Wardynski, 2002). In some cases, the sensory intensity levels of game sound may be such that affect really is a gut feeling as alluded to earlier in this chapter. Possible immersion through computer game sound may be strong enough to enable a similar affective experience by playing with audio only, as investigations in this direction suggest (Röber & Masuch, 2005).
PsYcHOPHYsIOLOGIcAL MEAsUrEMENt OF EMOtIONs As we have discussed before, a rather modern approach is the two-dimensional model of emotional affect and arousal suggested by Russell (1980, see Figure 1). Ekman’s (1992) insight that basic emotions are reflected in facial expressions was fundamental for subsequent studies investigating physiological responses of facial muscles using a method called electromyography (EMG) which measures subtle reactions of muscles in the human body (Cacioppo, Berntson, Larsen, Poehlmann, & Ito, 2004). For example, corrugator muscle activity (in charge of lowering the eyebrow) was found to increase when a person is in a bad mood (Larsen, Norris, & Cacioppo, 2003). In contrast to this, zygomaticus muscle activity (on the cheek)
increases during positive moods. High obicularis oculi muscle activity (responsible for closing the eyelid) is associated with mildly positive emotions (Cacioppo, Tassinary, & Berntson, 2007). An advantage of physiological assessment is that it can assess covert activity of facial muscles with great sensitivity to subtle reactions (Ravaja, 2004). Measuring emotions in the circumplex model of emotional valence and arousal is now possible during interactive events, such as playing games, by covertly recording the physiological activity of brow, cheek and eyelid muscle (Mandryk, 2008; Nacke & Lindley, 2008; Ravaja, et al., 2008). For the correct assessment of arousal, additional measurement of a person’s electrodermal activity (EDA) is necessary (Lykken & Venables, 1971), which is either measured from palmar sites (thenar/hypothenar eminences of the hand) or plantar sites (e.g. above abductor hallucis muscle and midway between the proximal phalanx of the big toe) (Boucsein, 1992). The conductance of the skin is directly related to the production of sweat in the eccrine sweat glands, which is entirely controlled by a human’s sympathetic nervous system. Increased sweat gland activity is related to electrical skin conductance. Using EMG measurements of facial muscles that reliably measure basic emotions and EDA measurements that indicate a person’s arousal, we can correlate emotional states of users to specific game events or even complete game sessions (Nacke, Lindley, & Stellmach, 2008; Ravaja, et al., 2008). Below, we refer to several experiments analyzing cumulative measurements of EMG and EDA to assess the overall affective experience of players in diverse game sound scenarios.
Pointers from Psychophysiological Experiments A set of preliminary experiments (Grimshaw et al., 2008; Nacke, 2009; Nacke, Grimshaw, & Lindley, 2010) investigated the impact of the sonic user experience and psychophysiological effects of
273
Player-Game Interaction Through Affective Sound
game sound (i.e., diegetic sound FX) and music in an FPS game. They measured EMG and EDA responses together with subjective questionnaire responses for 36 undergraduate students with a 2 × 2 repeated-measures factorial design using sound (on and off) and music (on and off) as predictor variables with a counter-balanced order of sound and music presentation in an FPS game level. Among many results, two are particularly interesting: (1) higher co-active EMG brow and eyelid activity when music was present than when it was absent (regardless of other sounds) and (2) a strong effect of sound on gameplay experience dimensions (IJsselsteijn, Poels, & de Kort, 2008). In the case of the latter result, higher subjective ratings of immersion, flow, positive affect, and challenge, together with lower negative affect and tension ratings, were discovered when sound was present than when it was absent (regardless of music). The psychophysiological results of this study put the usefulness of (tonic) psychophysiological measures to the test, since the literature points to expressions of antipathy when the facial muscles under investigation are activated at the same time (Bradley, Codispoti, Cuthbert, & Lang, 2001). The caveat here is that the most common stimuli that have been used in psychophysiological research are pictures (Lang, Greenwald, Bradley, & Hamm, 1993). Using music, especially in a highly immersive environment such as a firstperson perspective digital game, may lead to a number of emotions being elicited simultaneously and which might lie outside of the dimensional space that is being used in Russell’s (1980) model. This opinion argues that a person’s emotional experience is a cognitive interpretation of this automatic physiological response (Russell, 2003). But the bipolarity of the valence-arousal dimensions have been criticized before as the model is too rigid to allow for simultaneous (i.e., positive and negative) emotion measurements (Tellegen, Watson, & Clark, 1999). Using sound and music in a digital game is, however, a very ambiguous and complex use of stimuli and prior research has
274
suggested that the emotional responses to such complex stimuli can be simultaneously positive and negative (Larsen, McGraw, & Cacioppo, 2001; Larsen, McGraw, Mellers, & Cacioppo, 2004). Tellegen, et al. (1999) proposed a structural hierarchical model of emotion which might be more suited in this context by providing for both independent positive emotional activation (PA) and negative emotional activation (NA) organized in a three-level hierarchy. The top level is formed by a general bipolar Happiness-Unhappiness dimension, followed by the PA and NA dimension allowing discrete emotions to form its base. With this model, we could argue that the findings of Nacke et al. (2010) show an independent positive and negative emotional activation during the music conditions. This would, however, also indicate that the physiological activity is not a direct result of the sound and music conditions, but arguably of a combination of stimuli present during these conditions. In addition, greater electrodermal activity was found for female players when both sound and music were off, while the responses for male players were almost identical (see Figure 2). The authors assumed music to have a calming effect on female players, resulting in less arousal during gameplay. For females, music was also connected with pleasant emotions as higher eyelid EMG activity indicated. Overall, the psychophysiological results from that study pointed toward a positive emotional effect of the presence of both sound and music (see also Nacke, 2009). Interesting in this context is that music does not seem to be experienced significantly differently on a subjective level, whereas sound was clearly indicated as having an influence on game experience. Higher subjective ratings of immersion, flow, positive affect, and challenge, together with lower negative affect and tension ratings when sound was present, paint a positive picture of sound for a good game experience (particularly so when music is absent). The results discussed above are ones that run the gamut from expected (sound contributes
Player-Game Interaction Through Affective Sound
Figure 2. Results of electrodermal activity (EDA averages in log [µS]) from the Nacke et al. (2010) study, split up between gender, sound, and music conditions in the experiment (see also Nacke, 2009)
positively to the experience of playing games) to the interesting and meriting further investigation (for example, gender differences in sound affect in the context of FPS games). Being the results of preliminary experiments, they typically provoke more questions than they answer and such results should, for the time being, be viewed in the light of several limiting factors. For example, the experiments provided audio-visual stimuli (not solely audio) and the sub-genre of game used – the FPS game – proposes a hunter-and-the-hunted scenario which, perhaps, might account for the gender affect differences noted. Another limitation that needs to be considered in psychophysiological research is the effect of familiarity with a particular game genre and a psychological mindset. Thus, a personality test and demographic questions regarding playing habits and behaviour
will help circumvent possible priming effects of familiarity or non-familiarity with games in the experimental analysis. In our experiments, personality assessments and demographic questionnaires were handed out prior to each study to factor out priming elements later in the statistical analysis. Finally, it is difficult to correlate objective measurements taken during gameplay with subjective, post-experiential responses and it may well be that such psychophysiological measurements are not the most optimal method for assessing the role of sound in digital games.
cONcLUsION In this chapter, we have given an overview of the emotional components of gameplay experience
275
Player-Game Interaction Through Affective Sound
with a special focus on the influence of sound and music. We have discussed the results of experiments that have made use of both subjective and objective assessments of game sound and music. After these pilot studies, and our discussion of emotional theories and experiential constructs, we have to conclude that the detailed exploration of game sound and music at this stage of our knowledge is still difficult to conduct because there are few comparable research results available and there is not yet a perfect measurement methodology. The multi-method combination of subjective and objective quantitative measures is a good starting point from which to create and refine more specific methodologies for examining the impact of sound and music in games.
Important Questions and Future challenges The important questions regarding game design that aims to facilitate flow, fun, or immersive experiences are: should tasks be provided by the game (i.e., created by the designer), should they be encouraged by the game environment, or should finding the task be part of the gameplay? The latter is rather unlikely, since finding only one task at a time sequentially might frustrate players and choosing a pleasant task according to individual mood, emotional, or cognitive disposition will probably provide more fun. Thus, instead of saying players need to face tasks that can be completed, it might be better design advice to provide several game tasks at the same time and design for an environment that encourages playful interaction. An environment that facilitates flow, fun, or immersion would provide opportunities for the player to alternate between playing for its own sake (i.e., setting up their own tasks) and finding closure by completing a given task. Some of the future challenges here will include finding good experimental designs that clearly distinguish audio stimuli, while still being embedded in a gaming context, in order that the
276
measurements and results obtained remain valid and thus more readily informative for the design suggestions above. We also see a lot of potential in cross-correlation of subjective and objective measures in terms of attentional activation, such as the exploration of brain wave (that is EEG) data to find out more about the cognitive underpinnings of gameplay experience, by this means potentially separating experiential constructs from an affective emotional attribution and aligning them to an attentive cognitive attribution. Experiments might be designed to answer the question does attention guide immersion or vice versa? Others might investigate sound and affect in game genres other than FPS.
Potential of these New technologies for sound Design Why go to all this experimental trouble? After all, most digital games seem to function well enough with current sound design paradigms. The answer lies in two technologies both having great potential for the future of sound design. The first is procedural audio and as other chapters here deal with the subject in great depth (Farnell, 2011; Mullan, 2011), we limit ourselves to highlighting the importance of the ability to stipulate affectiveemotional parameters for the real-time synthesis of sound. It is generally accepted that a sudden, loud sound in a particular context (perhaps there is a preceding silence and darkness wrapped up in a horror genre context) is especially arousing. However, what is less understood is the role, for example, of timbre on affect and emotion and, in the context of digital games and virtual environments, immersion. Would it be effective to design an affective real-time sound synthesis sub-engine as part of the game engine where the controllable parameters are not amplitude and frequency but high-level factors such as fear, happiness, arousal, or relaxation? Perhaps these parameters could be governed by the player in the game set-up menu who might opt, for instance, for a more or less
Player-Game Interaction Through Affective Sound
emotionally intense experience through the use of a simple fader. This brings us to the second technology. Although rudimentary and imprecise, consumer biofeedback equipment for digital devices (including computers and gaming consoles) is beginning to appear.1 Pass the output of these devices (which are variations of the EMG and ECG/EKG technologies used in the experiments previously described) to the controllable parameters of the game sound engine proposed above and procedural audio becomes a highly responsive, affective, and emotive technology.2 Furthermore, a feedback loop is established in which both play and sound emotionally respond to each other. In effect, the game itself takes on an emotional character that reacts to the player’s affect state and emotions and that elicits affect responses and emotions in turn – perhaps the game’s character might be empathetic or antagonistic to the player. This is the future of game sound design and the reason for pursuing the line of enquiry described in this chapter.
rEFErENcEs Agarwal, R., & Karahanna, E. (2000). Time flies when you’re having fun: Cognitive absorption and beliefs about information technology usage. Management Information Systems Quarterly, 24(4), 665–694. doi:10.2307/3250951 Audiosurf. [Video game]. (2008). Dylan Fitterer (Developer), Bellevue, WA: Valve Corporation (Steam). Bateman, C. (2009). Beyond game design: Nine steps towards creating better videogames. Boston: Charles River Media.
Bentley, T., Johnston, L., & von Baggo, K. (2005). Evaluation using cued-recall debrief to elicit information about a user’s affective experiences. In T. Bentley, L.Johnston, & K. von Baggo (Eds.), Proceedings of the 17th Australian conference on Computer-Human Interaction (pp. 1-10). New York: ACM. Biedermann, I., & Vessel, E. A. (2006). Perceptual pleasure and the brain. American Scientist, 94(May-June), 247–253. Boucsein, W. (1992). Electrodermal activity. New York: Plenum Press. Bradley, M. M., Codispoti, M., Cuthbert, B. N., & Lang, P. J. (2001). Emotion and motivation I: Defensive and appetitive reactions in picture processing. Emotion (Washington, D.C.), 1(3), 276–298. doi:10.1037/1528-3542.1.3.276 Bradley, M. M., & Lang, P. J. (2000). Affective reactions to acoustic stimuli. Psychophysiology, 37, 204–215. doi:10.1017/S0048577200990012 Bradley, M. M., & Lang, P. J. (2007). Emotion and motivation . In Cacioppo, J. T., Tassinary, L. G., & Berntson, G. G. (Eds.), Handbook of psychphysiology (3rd ed., pp. 581–607). New York: Cambridge University Press. doi:10.1017/ CBO9780511546396.025 Brown, E., & Cairns, P. (2004). A grounded investigation of game immersion . In Dykstra-Erickson, E., & Tscheligi, M. (Eds.), CHI ‘04 extended abstracts (pp. 1297–1300). New York: ACM. Bushman, B. J., & Anderson, C. A. (2002). Violent video games and hostile expectations: A test of the General Aggression Model. Personality and Social Psychology Bulletin, 28(12), 1679–1686. doi:10.1177/014616702237649
Bateman, C., & Boon, R. (2006). 21st century game design. Boston: Charles River Media.
277
Player-Game Interaction Through Affective Sound
Cacioppo, J. T., Berntson, G. G., Larsen, J. T., Poehlmann, K. M., & Ito, T. A. (2004). The psychophysiology of emotion . In Lewis, M., & Haviland-Jones, J. M. (Eds.), Handbook of emotions (2nd ed., pp. 173–191). New York: Guilford Press. Cacioppo, J. T., Tassinary, L. G., & Berntson, G. G. (2007). Handbook of psychophysiology (3rd ed.). Cambridge, UK: Cambridge University Press. doi:10.1017/CBO9780511546396 Caillois, R. (2001). Man, play and games. Chicago: University of Illinois Press. Calleja, G. (2007). Digital games as designed experience: Reframing the concept of immersion. Unpublished doctoral dissertation. Victoria University of Wellington, New Zealand. Cannon, W. B. (1927). The James-Lange theory of emotions: A critical examination and an alternative theory. The American Journal of Psychology, 39(1/4), 106–124. doi:10.2307/1415404 Carnagey, N. L., Anderson, C. A., & Bushman, B. J. (2007). The effect of video game violence on physiological desensitization to real-life violence. Journal of Experimental Social Psychology, 43(3), 489–496. doi:10.1016/j.jesp.2006.05.003 Charlton, J. P., & Danforth, I. D. W. (2004). Differentiating computer-related addictions and high engagement . In Morgan, K., Brebbia, C. A., Sanchez, J., & Voiskounsky, A. (Eds.), Human perspectives in the internet society: culture, psychology and gender. Southampton: WIT Press. Clark, L., Lawrence, A. J., Astley-Jones, F., & Gray, N. (2009). Gambling near-misses enhance motivation to gamble and recruit win-related brain circuitry. Neuron, 61(3), 481–490. doi:10.1016/j. neuron.2008.12.031
278
Collins, K., Tessler, H., Harrigan, K., Dixon, M. J., & Fugelsang, J. (2011). Sound in electronic gambling machines: A review of the literature and its relevance to game audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Cowley, B., Charles, D., Black, M., & Hickey, R. (2008). Toward an understanding of flow in video games. Computers in Entertainment, 6(2), 1–27. doi:10.1145/1371216.1371223 Csíkszentmihályi, M. (1975). Beyond boredom and anxiety. San Francisco: Jossey-Bass Publishers. Csíkszentmihályi, M. (1990). Flow: The psychology of optimal experience. New York: Harper Perennial. Cunningham, S., Grout, V., & Picking, R. (2011). Emotion, content and context in sound and music . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Damasio, A. R. (1994). Descartes’ error. New York: G. P. Putnam. Darwin, C. (1899). The expression of the emotions in man and animals. New York: D. Appleton and Company. Dekker, A., & Champion, E. (2007). Please biofeed the zombies: Enhancing the gameplay and display of a horror game using biofeedback. In Proceedings of DiGRA: Situated Play Conference. Retrieved January 1, 2010, from http://www.digra. org/dl/db/07312.18055.pdf. Dix, A., Finlay, J., & Abowd, G. D. (2004). Human-computer interaction. Harlow, UK: Pearson Education. DJ hero. [Video game],(2009). FreeStyleGames (Developer), Santa Monica, CA: Activision.
Player-Game Interaction Through Affective Sound
Donkey Konga. [Video game], (2004). Namco (Developer), Kyoto: Nintendo. Dorval, M., & Pepin, M. (1986). Effect of playing a video game on a measure of spatial visualization. Perceptual and Motor Skills, 62, 159–162. Douglas, Y., & Hargadon, A. (2000). The pleasure principle: Immersion, engagement, flow. In Proceedings of the eleventh ACM on Hypertext and Hypermedia (pp.153-160), New York: ACM. Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6(3/4), 169–200. doi:10.1080/02699939208411068 Ekman, P., & Friesen, W. V. (1978). Facial action coding system: A technique for the measurement of facial movement. Palo Alto, CA: Consulting Psychologists Press. Electroplankton. [Video game], (2006). Indies Zero (Developer), Kyoto: Nintendo. Elite beat agents. [Video game], (2006). iNiS (Developer), Kyoto: Nintendo. Ermi, L., & Mäyrä, F. (2005). Fundamental components of the gameplay experience: Analysing immersion. In Proceedings of DiGRA 2005 Conference Changing Views: Worlds in Play. Retrieved January 1, 2010, from http://www.digra.org/dl/ db/06276.41516.pdf. Farnell, A. (2011). Behaviour, structure and causality in procedural audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Ferguson, C. J. (2007). Evidence for publication bias in video game violence effects literature: A meta-analytic review. Aggression and Violent Behavior, 12(4), 470–482. doi:10.1016/j. avb.2007.01.001
Fernandez, A. (2008). Fun experience with digital games: A model proposition . In Leino, O., Wirman, H., & Fernandez, A. (Eds.), Extending experiences: Structure, analysis and design of computer game player experience (pp. 181–190). Rovaniemi, Finland: Lapland University Press. Frequency. (2001). Sony Computer Entertainment (PlayStation 2). Frohlich, D., & Murphy, R. (1999, December 20). Getting physical: what is fun computing in tangible form? Paper presented at the Computers and Fun 2 Workshop, York, UK. Gackenbach, J. (2008). The relationship between perceptions of video game flow and structure. Loading... 1(3). Gilleade, K. M., & Dix, A. (2004). Using frustration in the design of adaptive videogames. In [New York: ACM.]. Proceedings of ACE, 2004, 228–232. Gilleade, K. M., Dix, A., & Allanson, J. (2005). Affective videogames and modes of affective gaming: Assist me, challenge me, emote me. In Proceedings of DiGRA 2005 Conference: Changing Views: Worlds in Play. Retrieved January 1, 2010, from http://www.digra.org/dl/ db/06278.55257.pdf. Gitaroo man. [Video game], (2001). Koei/iNiS (Developer) (PlayStation 2). Electroplankton. [Video game], (2006). Indies Zero (Developer), Kyoto: Nintendo. Grimshaw, M. (2007). The resonating spaces of first-person shooter games. In Proceedings of The 5th International Conference on Game Design and Technology. Retrieved January 1, 2010, from http://digitalcommons.bolton.ac.uk/ gcct_conferencepr/4/.
279
Player-Game Interaction Through Affective Sound
Grimshaw, M. (2008a). The acoustic ecology of the first-person shooter: The player, sound and immersion in the first-person shooter computer game. Saarbrücken: VDM Verlag Dr. Mueller. Grimshaw, M. (2008b). Sound and immersion in the first-person shooter. International Journal of Intelligent Games & Simulation, 5(1), 2–8. Grimshaw, M., Lindley, C. A., & Nacke, L. (2008). Sound and immersion in the first-person shooter: Mixed measurement of the player’s sonic experience. In Proceedings of Audio Mostly 2008 - A Conference on Interaction with Sound. Retrieved January 1, 2010, from http://digitalcommons. bolton.ac.uk/gcct_conferencepr/7/. Grimshaw, M., & Schott, G. (2008). A conceptual framework for the analysis of first-person shooter audio and its potential use for game engines. International Journal of Computer Games Technology, 2008. Guitar hero 5. [Video game], (2009). RedOctane (Developer), Santa Monica, CA: Activision. Guitar hero II. [Video game], (2006). RedOctane (Developer), Santa Monica, CA: Activision. Guitar hero III. [Video game], (2007). RedOctane (Developer), Santa Monica, CA: Activision. Guitar hero: On tour. [Video game], (2008). RedOctane (Developer), Santa Monica, CA: Activision. (Nintendo DS). Guitar hero. [Video game], (2005). RedOctane (Developer), New York: MTV Games. Guitar hero world tour. [Video game], (2008). RedOctane (Developer), Santa Monica, CA: Activision. Hazlett, R. L. (2006). Measuring emotional valence during interactive experiences: Boys at video game play. In Proceedings of CHI’06 (pp. 1023 – 1026). New York: ACM.
280
Hudlicka, E. (2008). Affective computing for game design. In Proceedings of the 4th International North American Conference on Intelligent Games and Simulation (GAMEON-NA).Montreal, Canada. IJsselsteijn, W., Poels, K., & de Kort, Y. A. W. (2008). The Game Experience Questionnaire: Development of a self-report measure to assess player experiences of digital games. FUGA Deliverable D3.3. Eindhoven, The Netherlands: TU Eindhoven. Ivory, J. D., & Kalyanaraman, S. (2007). The effects of technological advancement and violent content in video games on players’ feelings of presence, involvement, physiological arousal, and aggression. The Journal of Communication, 57(3), 532–555. doi:10.1111/j.1460-2466.2007.00356.x James, W. (1884). What is an emotion? Mind, 9(34), 188–205. doi:10.1093/mind/os-IX.34.188 Jennett, C., Cox, A. L., Cairns, P., Dhoparee, S., Epps, A., & Tijs, T. (2008). Measuring and defining the experience of immersion in games. International Journal of Human-Computer Studies, 66, 641–661. doi:10.1016/j.ijhcs.2008.04.004 Jørgensen, K. (2006). On the functional aspects of computer game audio. In Audio Mostly: A Conference on Sound in Games. Juul, J. (2005). Half-real: Video games between real rules and fictional worlds. Cambridge, MA: MIT Press. Kuikkaniemi, K., & Kosunen, I. (2007). Progressive system architecture for building emotionally adaptive games. In BRAINPLAY ’07: Playing with Your Brain Workshop at ACE (Advances in Computer Entertainment) 2007. Lang, P. J. (1995). The emotion probe. Studies of motivation and attention. The American Psychologist, 50, 372–385. doi:10.1037/0003066X.50.5.372
Player-Game Interaction Through Affective Sound
Lang, P. J., Greenwald, M. K., Bradley, M. M., & Hamm, A. O. (1993). Looking at pictures: Affective, facial, visceral, and behavioral reactions. Psychophysiology, 30, 261–273. doi:10.1111/j.1469-8986.1993.tb03352.x
Mandryk, R. L. (2008). Physiological measures for game evaluation . In Isbister, K., & Schaffer, N. (Eds.), Game usability: Advice from the experts for advancing the player experience (pp. 207–235). Burlington, MA: Elsevier.
Lange, C. G. (1912). The mechanism of the emotions . In Rand, B. (Ed.), The classical psychologists (pp. 672–684). Boston: Houghton Mifflin.
Mandryk, R. L., & Atkins, M. S. (2007). A fuzzy physiological approach for continuously modeling emotion during interaction with play environments. International Journal of HumanComputer Studies, 65(4), 329–347. doi:10.1016/j. ijhcs.2006.11.011
Larsen, J. T., McGraw, A. P., & Cacioppo, J. T. (2001). Can people feel happy and sad at the same time? Journal of Personality and Social Psychology, 81(4), 684–696. doi:10.1037/00223514.81.4.684 Larsen, J. T., McGraw, A. P., Mellers, B. A., & Cacioppo, J. T. (2004). The agony of victory and thrill of defeat: Mixed emotional reactions to disappointing wins and relieving losses. Psychological Science, 15(5), 325–330. doi:10.1111/j.09567976.2004.00677.x Larsen, J. T., Norris, C. J., & Cacioppo, J. T. (2003). Effects of positive and negative affect on electromyographic activity over zygomaticus major and corrugator supercilii. Psychophysiology, 40, 776–785. doi:10.1111/1469-8986.00078 Laurel, B. (1991). Computers as theatre. Boston, MA: Addison-Wesley. LeDoux, J. (1998). The emotional brain. London: Orion Publishing Group. Lego rock band. [Video game], (2009). Harmonix (Developer), New York: MTV Games. Lombard, M., & Ditton, T. (1997). At the heart of it all: The concept of presence. Journal of Computer-Mediated Communication, 3(2). Lykken, D. T., & Venables, P. H. (1971). Direct measurement of skin conductance: A proposal for standardization. Psychophysiology, 8(5), 656– 672. doi:10.1111/j.1469-8986.1971.tb00501.x
Menon, V., & Levitin, D. J. (2005). The rewards of music listening: Response and physiological connectivity of the mesolimbic system. NeuroImage, 28(1), 175–184. doi:10.1016/j.neuroimage.2005.05.053 Miller, D. J., & Robertson, D. P. (2009). Using a games console in the primary classroom: Effects of ‘Brain Training’ programme on computation and self-esteem. British Journal of Educational Technology, 41(2), 242–255. doi:10.1111/j.14678535.2008.00918.x Moffat, D. (1980). Personality parameters and programs . In Trappl, R., & Petta, P. (Eds.), Creating personalities for synthetic actors (pp. 120–165). Berlin: Springer. Mullan, E. (2011). Physical modelling for sound synthesis . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global. Murphy, D., & Pitt, I. (2001). Spatial sound enhancing virtual storytelling. In Proceedings of the International Conference ICVS, Virtual Storytelling Using Virtual Reality Technologies for Storytelling (pp. 20-29) Berlin: Springer. Murray, J. H. (1995). Hamlet on the holodeck: The future of narrative in cyberspace. New York: Free Press.
281
Player-Game Interaction Through Affective Sound
Nacke, L., Lindley, C., & Stellmach, S. (2008). Log who’s playing: Psychophysiological game analysis made easy through event logging. In P. Markopoulos, B. Ruyter, W. IJsselsteijn, & D. Rowland (Eds.), Proceedings of Fun and Games, Second International Conference (pp. 150-157). Berlin: Springer. Nacke, L., & Lindley, C. A. (2008). Flow and immersion in first-person shooters: Measuring the player’s gameplay experience. In Proceedings of the 2008 Conference on Future Play: Research, Play, Share (pp. 81-88). New York: ACM.
Phase. [Video game], (2007). Harmonix Music Systems. Picard, R. W. (1997). Affective computing. Cambridge, MA: MIT Press. Pillay, H. K. (2002). An investigation of cognitive processes engaged in by recreational computer game players: Implications for skills of the future. Journal of Research on Technology in Education, 34(3), 336–350. Plutchik, R. (2001). The nature of emotions. American Scientist, 89(4), 344–350.
Nacke, L. E. (2009). Affective ludology: Scientific measurement of user experience in interactive entertainment. Unpublished doctoral dissertation. Blekinge Institute of Technology, Karlskrona, Sweden. Retrieved January 1, 2010, from http:// affectiveludology.acagamic.com.
Posner, J., Russell, J. A., Gerber, A., Gorman, D., Colibazzi, T., & Yu, S. (2009). The neurophysiological bases of emotion: An fMRI study of the affective circumplex using emotion-denoting words. Human Brain Mapping, 30(3), 883–895. doi:10.1002/hbm.20553
Nacke, L. E., Grimshaw, M. N., & Lindley, C. A. (2010). More than a feeling: Measurement of sonic user experience and psychophysiology in a firstperson shooter. Interacting with Computers, 22(5), 336–343. doi:10.1016/j.intcom.2010.04.005
Posner, J., Russell, J. A., & Peterson, B. S. (2005). The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology. Development and Psychopathology, 17, 715–734. doi:10.1017/ S0954579405050340
Nakamura, J., & Csíkszentmihályi, M. (2002). The concept of flow . In Snyder, C. R., & Lopez, S. J. (Eds.), Handbook of positive psychology (pp. 89–105). New York: Oxford University Press. Norman, D. A. (2004). Emotional design. New York: Basic Books. Öhman, A., Flykt, A., & Esteves, F. (2001). Emotion drives attention: Detecting the snake in the grass. Journal of Experimental Psychology. General, 130(3), 466–478. doi:10.1037/00963445.130.3.466 Panksepp, J. (2004). Affective neuroscience: the foundations of human and animal emotions. Oxford: Oxford University Press. PaRappa the rapper. [Video game], (1996). Sony Computer Entertainment.
282
Przybylski, A. K., Ryan, R. M., & Rigby, S. C. (2009). The motivating role of violence in video games. Personality and Social Psychology Bulletin, 35(2), 243–259. doi:10.1177/0146167208327216 Pulman, A. (2007). Investigating the potential of Nintendo DS Lite handheld gaming consoles and Dr. Kawashima’s Brain Training software as a study support tool in numeracy and mental arithmetic. JISC TechDis HEAT Scheme Round 1 Project Reports. Retrieved June 6, 2009, from http://www.techdis.ac.uk/index.php?p=2_1_7_9. Quilitch, H. R., & Risley, T. R. (1973). The effects of play materials on social play. Journal of Applied Behavior Analysis, 6(4), 573–578. doi:10.1901/ jaba.1973.6-573
Player-Game Interaction Through Affective Sound
Ravaja, N. (2004). Contributions of psychophysiology to media research: Review and recommendations. Media Psychology, 6(2), 193–235. doi:10.1207/s1532785xmep0602_4 Ravaja, N., Turpeinen, M., Saari, T., Puttonen, S., & Keltikangas-Järvinen, L. (2008). The psychophysiology of James Bond: Phasic emotional responses to violent video game events. Emotion (Washington, D.C.), 8(1), 114–120. doi:10.1037/1528-3542.8.1.114
Schachter, S. (1964). The interaction of cognitive and physiological determinants of emotional state . In Berkowitz, L. (Ed.), Advances in experimental social psychology (Vol. 1, pp. 49–80). New York: Academic Press. doi:10.1016/S00652601(08)60048-9 Schachter, S., & Singer, J. (1962). Cognitive, social, and physiological determinants of emotional state. Psychological Review, 69, 379–399. doi:10.1037/h0046234
Reiter, U. (2011). Perceived quality in game audio . In Grimshaw, M. (Ed.), Game sound technology and player interaction: Concepts and developments. Hershey, PA: IGI Global.
Schlosberg, H. (1952). The description of facial expressions in terms of two dimensions. Journal of Experimental Psychology, 44(4), 229–237. doi:10.1037/h0055778
Rez. [Video game], (2001). Sega (Developer, Dreamcast), Sony Computer Entertainment Europe (Developer, PlayStation 2).
Seah, M., & Cairns, P. (2008). From immersion to addiction in videogames. In [New York: ACM.]. Proceedings of BCS HCI, 2008, 55–63.
Rhodes, L. A., David, D. C., & Combs, A. L. (1988). Absorption and enjoyment of music. Perceptual and Motor Skills, 66, 737–738.
Shilling, R., Zyda, M., & Wardynski, E. C. (2002). Introducing emotion into military simulation and videogame design: America’s Army: Operations and VIRTE. In Conference GameOn 2002. Retrieved January 1, 2010, from http://gamepipe. usc.edu/~zyda/pubs/ShillingGameon2002.pdf.
Röber, N., & Masuch, M. (2005). Leaving the screen: New perspectives in audio-only gaming. In Proceedings of 11th International Conference on Auditory Display (ICAD). Rock band. [Video game], (2007). New York: MTV Games. Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. doi:10.1037/h0077714 Russell, J. A. (2003). Core affect and the psychological construction of emotion. Psychological Review, 110(1), 145–172. doi:10.1037/0033295X.110.1.145 Ryan, R., Rigby, C., & Przybylski, A. (2006). The motivational pull of video games: A self-determination theory approach. Motivation and Emotion, 30(4), 344–360. doi:10.1007/s11031-006-9051-8
SimTunes. [Video game], (1996). Maxis (Developer). SingStar. [Video game], (2004). Sony Computer Entertainment Europe (PlayStation 2 & 3). Slater, M. (2002). Presence and the sixth sense. Presence (Cambridge, Mass.), 11(4), 435–439. doi:10.1162/105474602760204327 Sweetser, P., & Wyeth, P. (2005). GameFlow: A model for evaluating player enjoyment in games. [CIE]. Computers in Entertainment, 3(3), 3. doi:10.1145/1077246.1077253 Tellegen, A., Watson, D., & Clark, A. L. (1999). On the dimensional and hierarchical structure of affect. Psychological Science, 10(4), 297–303. doi:10.1111/1467-9280.00157
283
Player-Game Interaction Through Affective Sound
Traxxpad. [Video game], (2007). Eidos Interactive (PlayStation Portable). von Ahn, L., & Dabbish, L. (2008). Designing games with a purpose. Communications of the ACM, 51(8), 58–67. doi:10.1145/1378704.1378719 Watson, D., & Tellegen, A. (1985). Toward a consensual structure of mood. Psychological Bulletin, 98(2), 219–235. doi:10.1037/0033-2909.98.2.219 Watson, D., Wiese, D., Vaidya, J., & Tellegen, A. (1999). The Two General Activation Systems of Affect: Structural findings, evolutionary considerations, and psychobiological evidence. Journal of Personality and Social Psychology, 76(5), 820–838. doi:10.1037/0022-3514.76.5.820 WiiMusic. [Video game], (2008). Kyoto: Nintendo. Wundt, W. (1896). Grundriss der Psychologie. Leipzig, Germany: Alfred Kröner Verlag. Zahorik, P., & Jenison, R. L. (1998). Presence as being-in-the-world. Presence (Cambridge, Mass.), 7(1), 78–89. doi:10.1162/105474698565541
ADDItIONAL rEADING Brewster, S. A., & Crease, M. G. (1999). Correcting menu usability problems with sound. Behaviour & Information Technology, 18(3), 165–177. doi:10.1080/014492999119066 DeRosa, P. (2007). Tracking player feedback to improve game design. Gamasutra. Retrieved May 21, 2009, from http://www.gamasutra.com/view/ feature/1546/tracking_player_feedback_to_.php. Edworthy, J. (1998). Does sound help us to work better with machines? A commentary on Rauterberg’s paper ‘About the importance of auditory alarms during the operation of a plant simulator’. Interacting with Computers, 10(4), 401–409.
284
Isbister, K., & Schaffer, N. (2008). Game usability: Advice from the experts for advancing the player experience. Burlington, MA: Morgan Kaufmann Publishers. James, W. (1994). The physical basis of emotion. Psychological Review, 101(2), 205–210. doi:10.1037/0033-295X.101.2.205 Jenkins, S., Brown, R., & Rutterford, N. (2009). Comparing thermographic, EEG, and subjective measures of affective experience during simulated product interactions. International Journal of Design, 3(2), 53–65. Kahneman, D. (1973). Attention and effort. Englewood Cliffs, NJ: Prentice-Hall. Koster, R. (2005). A theory of fun for game design. Phoenix, AZ: Paraglyph Press. Lang, P. J. (1994). The varieties of emotional experience: A meditation on James-Lange Theory. Psychological Review, 101, 211–221. doi:10.1037/0033-295X.101.2.211 Lazzaro, N. (2003). Why we play: Affect and the fun of games . In Jacko, J. A., & Sears, A. (Eds.), The human-computer interaction handbook: Fundamentals, evolving technologies, and emerging applications (pp. 679–700). New York: Lawrence Erlbaum. Math