Vision in 3D Environments

This page intentionally left blank Vision in 3D Environments Biological and machine systems exist within a complex an...

Author: Laurence R. Harris | Michael R. M. Jenkin (editors)

328 downloads 2175 Views 7MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

This page intentionally left blank

Vision in 3D Environments Biological and machine systems exist within a complex and changing three-dimensional world. We appear to have no difﬁculty understanding this world, but how do we go about forming a perceptual model of it? Centered around three key themes – depth processing and stereopsis, motion and navigation in 3D, and natural-scene perception – this volume explores the latest research into the perception of three-dimensional environments. It features contributions from top researchers in the ﬁeld, presenting both biological and computational perspectives. Topics covered include binocular perception, blur and perceived depth, stereoscopic motion in depth, and perceiving and remembering the shape of visual space. This unique book will provide students and researchers with an overview of ongoing research as well as with perspectives on future developments in the ﬁeld. LAURENCE R. HARRIS is Professor of Psychology at York University, Toronto. He is a neuroscientist with a background in sensory processes. MICHAEL R. M. JENKIN is Professor of Computer Science and Engineering at York University, Toronto. A computer scientist, he works in the area of visually guided autonomous systems.

This volume stems from the biennial York University Centre for Vision Research meetings. The editors have collaborated on a number of edited volumes resulting from these meetings. Previous volumes include: Cortical Mechanisms of Vision | 2009 | 9780521889612 Computational Vision in Neural and Machine Systems | 2007 | 9780521862608 Vision and Action | 2010; originally published in hardback 1998 | 9780521148399

Supplementary material, including color versions of a selection of the ﬁgures, is available online at www.cambridge.org/9781107001756.

Vision in 3D Environments Edited by

Laur ence R. Harris York University, Toronto, Canada

Michael R. M. Jenkin York University, Toronto, Canada

camb ridge university press Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi, Tokyo, Mexico City Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9781107001756 © Cambridge University Press 2011 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2011 Printed in the United Kingdom at the University Press, Cambridge A catalog record for this publication is available from the British Library Library of Congress Cataloging in Publication data Vision in 3D environments / edited by Laurence R. Harris, Michael R.M. Jenkin. p. cm. Includes bibliographical references and indexes. ISBN 978-1-107-00175-6 1. Depth perception. 2. Binocular vision. 3. Human information processing. I. Harris, Laurence, 1953– II. Jenkin, Michael, 1959– QP487.V57 2011 152.14–dc23 2011012443 ISBN 978-1-107-00175-6 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate.

Contents

List of contributors 1

page x

Seeing in three dimensions

1

Michael R. M. Jenkin and Laurence R. Harris 1.1 Structure of this volume 5 References 7

Part I 2

Depth processing and stereopsis

Physiologically based models of binocular depth perception 11 Ning Qian and Yongjie Li 2.1 2.2 2.3 2.4 2.5 2.6

Introduction 11 Horizontal disparity and the energy model 12 Disparity attraction and repulsion 20 Vertical disparity and the induced effect 21 Relative versus absolute disparity 28 Phase-shift and position-shift RF models and a coarse-to-ﬁne stereo algorithm 30 2.7 Are cells with phase-shift receptive ﬁelds lie detectors? 32 2.8 Motion–stereo integration 33 2.9 Interocular time delay and Pulfrich effects 34 2.10 Concluding remarks 38 Acknowledgments 39 References 39

3

The Influence of monocular regions on the binocular perception of spatial layout 46 Barbara Gillam 3.1 3.2

Da Vinci stereopsis 48 Monocular-gap stereopsis 56

v

vi

Contents 3.3 Phantom stereopsis 62 3.4 Ambiguous stereopsis 65 3.5 Conclusions 66 References 68

4

Information, illusion, and constancy in telestereoscopic viewing 70 Brian Rogers

4.1 The concept of illusion 70 4.2 The telestereoscope 72 4.3 Size and disparity scaling 75 4.4 Telestereoscopic viewing: two predictions 77 4.5 Four experimental questions 79 4.6 Methods and procedure 81 4.7 The geometry of telestereoscopic viewing 82 4.8 Results 85 4.9 Summary of results 89 4.10 Reconciling the conﬂicting results 91 4.11 Conclusions 93 References 93

5

The role of disparity interactions in perception of the 3D environment 95 Christopher W. Tyler 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8

Introduction 95 Global interactions 98 Local target structure 99 Psychophysical procedure 99 Position tuning 100 Disparity selectivity of contrast masking 102 Size speciﬁcity of disparity masking 104 Relationship of masking to test disparity: absolute or relative? 105 5.9 Computational model 108 5.10 Polarity speciﬁcity of disparity masking 110 5.11 The nature of disparity masking 111 5.12 Relation to the 3D environment 112 Acknowledgments 112 References 112

6

Blur and perceived depth

115

Martin S. Banks and Robert T. Held 6.1 6.2

Introduction 115 Background 116

Contents 6.3 6.4 6.5

Probabilistic modeling of blur as a distance cue 119 Predictions of the model 124 Psychophysical experiment on estimating absolute distance from blur 128 6.6 Reconsidering blur as a depth cue 130 References 133

7

Neuronal interactions and their role in solving the stereo correspondence problem 137 Jason M. Samonds and Tai Sing Lee 7.1 7.2 7.3 7.4 7.5 7.6

Introduction 137 The disparity energy model 139 How to avoid false matches 139 Why do computer vision algorithms perform better? 144 Neurophysiological evidence for spatial interactions 147 Relationship with visual processing of contours and 2D patterns 151 7.7 Conclusions 155 Acknowledgments 156 References 156

Part II 8

Motion and navigation in 3D

Stereoscopic motion in depth 163 Robert S. Allison and Ian P. Howard 8.1 8.2 8.3

Introduction 163 Visual cues to motion in depth 164 Motion in depth from spatially uncorrelated images: effects of velocity and temporal frequency 167 8.4 Effects of density 170 8.5 Stimulus features 171 8.6 Lifetime 173 8.7 Segregated stimuli 175 8.8 General discussion 178 Acknowledgments 183 References 184

9

Representation of 3D action space during eye and body motion 187 W. Pieter Medendorp and Stan Van Pelt 9.1 9.2 9.3

Introduction 187 Sensorimotor transformations 188 Spatial constancy in motor control 189

vii

viii

Contents 9.4 9.5 9.6 9.7 9.8

Quality of spatial constancy 190 Reference frames for spatial constancy 190 Neural mechanisms for spatial constancy across saccades 192 Gaze-centered updating of target depth 194 Spatial-constancy computations during body movements 197 9.9 Signals in spatial constancy 201 9.10 Conclusions 203 Acknowledgments 203 References 203

10

Binocular motion-in-depth perception: contributions of eye movements and retinal-motion signals 208 Julie M. Harris and Harold T. Nefs

10.1 Introduction 208 10.2 A headcentric framework for motion perception 208 10.3 Are retinal and extraretinal sources of motion information separate? 212 10.4 Can we understand 3D motion perception as a simple extension of lateral-motion perception? 213 10.5 How well do our eyes track a target moving in depth? 215 10.6 Are eye movements sufﬁcient for motion-in-depth perception? 217 10.7 How does the combination of retinal and extraretinal information affect the speed of motion in depth? 220 10.8 Do eye movements affect the perception of aspects of motion other than speed? 221 10.9 Comparing Aubert–Fleischl and induced motion effects for lateral motion and motion in depth 224 10.10 Summary and conclusions 225 References 226

11

A surprising problem in navigation 228 Yogesh Girdhar and Gregory Dudek

11.1 Introduction 228 11.2 Related work 230 11.3 Surprise 231 11.4 Ofﬂine navigation summaries 235 11.5 Online navigation summaries 238 11.6 Results 242 11.7 Conclusions 249 References 249

Contents

Part III 12

Natural-scene perception

Making a scene in the brain 255 Russell A. Epstein and Sean P. MacEvoy

12.1 Introduction 255 12.2 Efﬁcacy of human scene recognition 256 12.3 Scene-processing regions of the brain 258 12.4 Probing scene representations with fMRI 262 12.5 Integrating objects into the scene 270 12.6 Conclusions 274 Acknowledgments 274 References 274

13

Surface color perception and light field estimation in 3D scenes 280

Laurence T. Maloney, Holly E. Gerhard, Huseyin Boyaci, and Katja Doerschner 13.1 The light ﬁeld 280 13.2 Lightness and color perception with changes in orientation 285 13.3 Lightness perception with changes in location 289 13.4 Representing the light ﬁeld 292 13.5 The psychophysics of the light ﬁeld 295 13.6 Conclusions 301 Acknowledgments 303 References 304

14

Representing, perceiving, and remembering the shape of visual space 308 Aude Oliva, Soojin Park, and Talia Konkle

14.1 Introduction 308 14.2 Representing the shape of a space 311 14.3 Perceiving the shape of a space 317 14.4 Remembering the shape of a space 323 14.5 From views to volume: integrating space 14.6 Conclusions 333 Acknowledgments 334 References 334

Author Index

341

Subject Index

351

330

ix

Contributors

Robert S. Allison Centre for Vision Research and Department of Computer Science and Engineering York University Toronto, Ontario, Canada

Martin S. Banks Vision Science Program and Bioengineering Graduate Group University of California Berkeley, CA, USA

Huseyin Boyaci Bilkent University Department of Psychology IISBF Building, 343 Ankara, Turkey

Katja Doerschner Bilkent University Department of Psychology IISBF Building, 343 Ankara, Turkey

Gregory Dudek Centre for Intelligent Machines McGill University 3580 University St. Montreal, Quebec, Canada

x

List of contributors

Russell A. Epstein Center for Cognitive Neuroscience University of Pennsylvania 3720 Walnut St. Philadelphia, PA, USA

Holly E. Gerhard Department of Psychology Center for Neural Science New York University 6 Washington Place, 8th Floor New York, NY, USA

Barbara Gillam School of Psychology University of New South Wales Sydney, NSW, Australia

Yogesh Girdhar Centre for Intelligent Machines McGill University 3580 University St. Montreal, Quebec, Canada

Julie M. Harris University of St. Andrews School of Psychology South Street, St. Andrews Fife, Scotland

Laurence R. Harris Centre for Vision Research and Department of Psychology York University Toronto, Ontario, Canada

Robert T. Held The UC Berkeley–UCSF Graduate Program in Bioengineering University of California Berkeley, CA, USA

xi

xii

List of contributors

Ian P. Howard Centre for Vision Research York University Toronto, Ontario, Canada

Michael R. M. Jenkin Centre for Vision Research and Department of Computer Science and Engineering York University Toronto, Ontario, Canada

Talia Konkle Department of Brain and Cognitive Science Massachusetts Institute of Technology Cambridge, MA, USA

Tai Sing Lee Carnegie Mellon University Center for the Neural Basis of Cognition 4400 Fifth Ave. Pittsburgh, PA, USA

Yongjie Li Department of Neuroscience Columbia University New York, NY, USA

Sean P. MacEvoy Department of Psychology Boston College 140 Commonwealth Ave. Chestnut Hill, MA, USA

Laurence T. Maloney Department of Psychology and Center for Neural Science New York University 6 Washington Pl., 8th Floor New York, NY, USA

List of contributors

W. Pieter Medendorp Donders Institute for Brain, Cognition and Behaviour Radboud University Nijmegen Nijmegen The Netherlands

Harold T. Nefs Delft University of Technology Faculty of EEMSC, Man Machine Interaction/New Delft Experience Lab Mekelweg 4, Delft The Netherlands

Aude Oliva Department of Brain and Cognitive Science Massachusetts Institute of Technology Cambridge, MA, USA

Soojin Park Department of Brain and Cognitive Science Massachusetts Institute of Technology Cambridge, MA, USA

Ning Qian Department of Neuroscience Columbia University New York, NY, USA

Brian Rogers Department of Psychology Oxford University South Parks Road Oxford, UK

Jason M. Samonds Center for Neural Basis of Cognition Carnegie Mellon University 4400 Fifth Ave. Pittsburgh, PA, USA

xiii

xiv

List of contributors

Christopher W. Tyler The Smith-Kettlewell Eye Research Institute 2318 Fillmore St. San Francisco, CA, USA

Stan Van Pelt Donders Institute for Brain, Cognition and Behaviour Radboud University Nijmegen Nijmegen The Netherlands

1

Seeing in three dimensions michael r. m. jenkin and laurence r. harris

Seeing in 3D is a fundamental problem for any organism or device that has to operate in the real world. Answering questions such as “how far away is that?” or “can we ﬁt through that opening?” requires perceiving and making judgments about the size of objects in three dimensions. So how do we see in three dimensions? Given a sufﬁciently accurate model of the world and its illumination, complex but accurate models exist for generating the pattern of illumination that will strike the retina or cameras of an active agent (see Foley et al., 1995). The inverse problem, how to build a three-dimensional representation from such two-dimensional patterns of light impinging on our retinas or the cameras of a robot, is considerably more complex. In fact, the problem of perceiving 3D shape and layout is a classic example of an ill-posed and underconstrained inverse problem. It is an underconstrained problem because a unique solution is not obtainable from the visual input. Even when two views are present (with the slightly differing viewpoints of each eye), the images do not necessarily contain all the information required to reconstruct the three-dimensional structure of a viewed scene. It is an illposed problem because small changes in the input can lead to signiﬁcant changes in the output: that is, reconstruction is very vulnerable to noise in the input signal. The problem of constructing the three-dimensional structure of the viewed scene is an extremely difﬁcult and usually impossible problem to solve uniquely. Seeing in three dimensions relies on using a range of assumptions and memories to guide the choice of solution when no unique solution is possible. Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

1

2

M. R. M. Jenkin and L. R. Harris Humans and seeing machines rely on assumptions such as smoothness, continuity, the expected illumination direction, and the like to help constrain the problem. But even with these extrasensory aids, constructing three-dimensional layout is an easy problem to get wrong. Errors can lead to potentially catastrophic results for a biological or machine system attempting the task. Introductory psychology textbooks are littered with amusing illusions that illustrate how difﬁcult and ambiguous it is to perceive 3D environments from specially constructed 2D illustrations. The Ames room (see Chapter 4), for example, illustrates how a properly contrived 3D image (the room must be viewed with one eye from a carefully determined viewpoint) can lead to a gross misperception of its threedimensional structure. However, it is not necessary to construct such contrived environments, and one does not have to go to the ends of the Earth to ﬁnd threedimensional environments that are challenging to perceive correctly. Almost any environment that falls outside of our normal perceptual experience (i.e., for which our assumptions are unreliable) can lead to perceptual errors. For example, Figure 1.1a shows a view of Ollantaytambo, in Peru. Ollantaytambo is located about nine thousand feet above sea level and at one point served as the capital of the Inca empire. This is a natural mountain that has been modiﬁed for human use. One would imagine that this type of scene should be sufﬁciently natural that it would be relatively straightforward to see – or to construct an accurate internal three-dimensional representation of the contents of the scene. Assuming that this is so, how large are the horizontal steps carved into the mountainside? Figure 1.1b shows an alternate view of the large ﬂat “steps” at Ollantaytambo. The steps are actually terraced gardens, and the true scale of the site is perhaps made more clear by the smaller but still quite large steps in the foreground and by the tourists visible in the scene. Without accurate cues to scale, it can be extremely difﬁcult to see accurately in three dimensions. The human visual system relies on a wide variety of different perceptual cues to provide information about the three-dimensional structure of our world. Figure 1.2 illustrates a subset of the perceptual cues available when one is perceiving three-dimensional structure. Most of these cues provide useful information at a particular range of distances. For example, atmospheric perspective (where things appear dimmer when viewed through a larger amount of atmosphere) is useful only at very large distances, whereas stereoscopic cues are most useful at close range when disparities between the images of the two eyes are largest. The operating range over which a selection of these cues are useful is shown in Figure 1.3. A large literature exists concerning the mechanisms for processing each of the cues identiﬁed in these ﬁgures, along with how they

Seeing in three dimensions

(a) Ollantaytambo

(b) Alternate view

Figure 1.1 It can be very difﬁcult to perceive structures properly in 3D. (a) shows part of the ruins of Ollantaytambo, located in southern Peru. Judging the scale of the structure can be quite difﬁcult. (b) shows a closeup of the steps. Tourists in the scene provide a better estimate of the scale of the steps seen in (a). Photos by one of the authors (MJ).

can be used in various situations. Classic investigations can be found in the motion-related work of Wallach and O’Connell (1953) and the brilliant insights of Helmholtz (1910/1925). Wheatstone (1938) and Julesz (1971) provided the solid basis of our understanding of the processing of binocular-disparity cues, and Gibson (1950) and Kaufman (1974) provided in-depth reviews of the use of monocular cues in the perception of 3D structure. A decomposition of the literature into work on individual cues has permeated both the biological and the computational literature. Indeed, for some period of time the computational literature was very much focused on shape from X problems, where X corresponded to one of the boxes in the hierarchy. Each individual cue provides a very underconstrained solution for inferring three-dimensional structure. An example is provided by the perspective cues shown in Figure 1.4. The lines on the pavement in this square in Portugal radiate from a central point. But when viewed from some angles (Figure 1.4b), they appear as parallel. The convergence of

3

4

M. R. M. Jenkin and L. R. Harris 3D Information

Oculomotor

Convergence

Visual

Accommodation

Myosis

Binocular

Monocular

Motionbased

Deletion/ accretion

Changing size

Shading

Texture gradient

Occlusion

Occlusion

Relative height

Relative size

Static

Motion parallax

Atmospheric perspective

Linear perspective

Shadows

Familiarity

Figure 1.2 Some cues to 3D structure. A nonexhaustive but hopefully illustrative set of cues that can provide 3D information.

truly parallel lines is often given as an example of a perspective cue from which depth can be judged. However, this example shows that what is perceived as “parallel” is not trustworthy. As a consequence of ambiguities of this kind, a wide range of prior assumptions (which might themselves be faulty, as this example shows) must be made when interpreting each cue. Another example of an assumption built into the interpretation of depth cues can be found when one is perceiving shape from shading. In the absence of a clear source of illumination, light is assumed to come from “above” (Mamassian and Goutcher, 2001; Ramachandran, 1988). A ﬂat surface shaded from light to dark from top to bottom thus appears convex, while the same surface when presented the other way up appears concave. However, it is now becoming clear that the perception of 3D information involves the integration of cues from several of the sources listed in Figures 1.2 and 1.3. Interactions between the perceptual cues can occur at various levels. Thus, although some chapters in this volume consider speciﬁc static cues to the generation of 3D information in isolation (e.g., Chapters 6, 12, and 13), others

Seeing in three dimensions Personal space

Just-discriminable depth threshold

0.001

Motion perspective

Action space

Vista space Occlusion

Height in visual field Relative size

0.01

Aerial perspective

0.1

1.0

Relative density

Binocular disparity Convergence and accommodation

1

10

100

1 000

10 000

Horizon Distance from observer (m) Figure 1.3 The just-discriminable depth threshold (detectable difference in depth as a fraction of viewing distance) for information provided by various cues is plotted here as a (somewhat idealized) function of distance from the observer in meters. The area under each curve indicates when that information source is suprathreshold. Although some cues, such as occlusion, relative density, and relative size, are useful at all distances, others are only useful over a narrow range of distances. The typical distance to the horizon for an Earth-based observer is also shown. Diagram redrawn from Cutting and Vishton (1995).

consider speciﬁc cue interactions. For example, Chapter 3 explores interactions between monocular, so-called pictorial, and binocular visual cues. Seeing in three dimensions is thus the result of the integration of a large number of different cues, each of which is predicated upon its own set of assumptions and constraints. Many of these cues are not available independently, but rather must be combined in order to be properly interpreted.

1.1

Structure of this volume

This volume is organized around three themes related to the perception of three-dimensional space. Part I looks at depth processing and stereopsis. Chapter 2 considers how the underlying physiology constrains the computational process of binocular combination. Chapter 3 considers how monocular information inﬂuences binocular perception. Chapter 4 explores how stereo image processing takes place in telestereoscopes. It is often assumed that local disparity computation is performed in isolation. Chapter 5 looks at interactions between disparity computations. Chapter 6 considers the inﬂuence of blur on

5

6

M. R. M. Jenkin and L. R. Harris

(a)

(b)

Figure 1.4 Photographs of markings in a square in Vila Viçosa, Portugal. In the square, there are lines radiating out from a statue (a). When the lines are viewed from the statue they appear to be parallel, even though logically it is apparent that the distance between the lines in the distance must be much further than the distance between them close to the observer. Perspective-based cues to depth can be highly misleading. Photos by one of the authors (LRH).

the perception of depth. Chapter 7 examines how local interactions between neuronal units aids in the processing of disparity information. The ability to obtain and maintain a three-dimensional spatial representation is a key requirement of mobility in three dimensions. Part II looks at motion and navigation in three dimensions. Chapter 8 looks at the problem of stereo information processing in a temporal ﬁeld. Chapter 9 examines the representation of space during eye and body motions. Chapter 10 looks at the interaction between eye movement and binocular motion-in-depth perception. Chapter 11 applies the problem of maintaining a three-dimensional representation of space to the problem of robot navigation. Lastly, Part III of this volume considers the problem of perceiving natural scenes. Chapter 12 presents functional magnetic resonance imaging (fMRI) work examining how humans construct a three-dimensional scene in the brain. Chapter 13 looks at the relationship between color and lightness perception. Finally, Chapter 14 considers how we perceive, represent, and maintain an internal representation of the shape of visual space.

Seeing in three dimensions References Cutting, J. E. and Vishton, P. M. (1995). Perceiving layout and knowing distances: the interaction, relative potency, and contextual use of different information about depth. In W. Epstein and S. Rogers (eds.), Perception of Space and Motion, pp. 69–117. San Diego, CA: Academic Press. Foley, J. D., van Dam, A., Feiner, S. K., and Hughes, J. F. (1995). Computer Graphics: Principles and Practice. Second Edition in C. New York: Addison-Wesley. Gibson, J. J. (1950). The Perception of the Visual World. Boston, MA: Houghton Mifﬂin. Helmholtz, H. von (1910/1925). Helmholtz’ Treatise on Physiological Optics. Translation of the 3rd edition (1910). New York: Dover. Julesz, B. (1971). Foundations of Cyclopean Perception. Chicago, IL: University of Chicago Press. Kaufman, L. (1974). Sight and Mind. New York: Oxford University Press. Mamassian, P. and Goutcher, R. (2001). Prior knowledge on the illumination position. Cognition, 81: B–B9. Ramachandran, V. S. (1988). The perception of shape from shading. Nature, 331: 163–166. Todd, J. T. (2004). The visual perception of 3D shape. Trends Cogn. Sci., 8: 115–121. Wallach, H. and O’Connell, D. N. (1953). The kinetic depth effect. J. Exp. Psychol., 45: 205–217. Wheatstone, C. (1938). On some remarkable, and hitherto unobserved, phenomena of binocular vision. Phil. Trans. R. Soc. Lond., 33: 371–394.

7

PART I DEPTH PROCESSI N G A N D STE R E OP SI S

2

Physiologically based models of binocular depth perception ning qian and yongjie li

2.1

Introduction

We perceive the world as three-dimensional. The inputs to our visual system, however, are only a pair of two-dimensional projections on the two retinal surfaces. As emphasized by Marr and Poggio (1976), it is generally impossible to uniquely determine the three-dimensional world from its two-dimensional retinal projections. How, then, do we usually perceive a well-deﬁned threedimensional environment? It has long been recognized that, since the world we live in is not random, the visual system has evolved and developed to take advantage of the world’s statistical regularities, which are reﬂected in the retinal images. Some of these image regularities, termed depth cues, are interpreted by the visual system as depth. Numerous depth cues have been discovered. Many of them, such as perspective, shading, texture, motion, and occlusion, are present in the retina of a single eye, and are thus called monocular depth cues. Other cues are called binocular, as they result from comparing the two retinal projections. In the following, we will review our physiologically based models for three binocular depth cues: horizontal disparity (Qian, 1994; Chen and Qian, 2004), vertical disparity (Matthews et al., 2003), and interocular time delay (Qian and Andersen, 1994; Qian and Freeman, 2009). We have also constructed a model for depth perception from monocularly occluded regions (Assee and Qian, 2007), another binocular depth cue, but have omitted it here owing to space limitations.

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

11

12

N. Qian and Y. Li 2.2

Horizontal disparity and the energy model

The strongest binocular depth cue is the horizontal component of the binocular disparity, deﬁned as the positional difference between the two retinal projections of a given point in space (Figure 2.1). It has long been recognized that the brain uses the horizontal disparity to estimate the relative depths of objects in the world with respect to the ﬁxation point, a process known as stereovision or stereopsis (Howard, 2002). With retinal positions expressed as visual angles, F P

Z

ψ ψ

Z φ

φ

ol

Left retina fl Left eye

Binocular disparity = φ –φ = ψ –ψ

or

a

fl

fr Right eye (a) Geometry

Right retina

fr

(b) Binocular disparity

Figure 2.1 The geometry of binocular projection (a) and the deﬁnition of binocular disparity (b). For simplicity, we consider only the plane of regard deﬁned by the instantaneous ﬁxation point (F) and the optical centers (ol and or ) of the two eyes (i.e., the points in the eyes’ optical system through which the light rays can be assumed to pass in straight lines). The two foveas (fl and fr ) are considered as corresponding to each other and thus have zero disparity. To make clear the positional relationship between other locations on the two retinas, one can imagine superimposing the two retinas with the foveas aligned (bottom). The ﬁxation point F in space projects approximately to the two corresponding foveas (fl and fr ), with a near-zero disparity. The disparity of any other point in space can then be deﬁned as φ1 − φ2 , which is equal to ψ2 − ψ1 . It then follows that all zero-disparity points in the plane fall on the so-called Vieth–Müller circle passing through the ﬁxation point and the two optical centers, since all circumference angles corresponding to the same arc (ol or ) are equal. Other points in the plane do not project to corresponding locations on the two retinas, and thus have nonzero disparities. Each circle passing through the two optical centers deﬁnes a set of isodisparity points. When the ﬁxation distance is much larger than the interocular separation and the gaze direction is not very eccentric, the constant-disparity surfaces can be approximated by frontoparallel planes.

Physiologically based models of binocular depth perception the horizontal disparity H for an arbitrary point P in Figure 2.1a is deﬁned as φ1 − φ2 , which is equal to ψ2 − ψ1 based on geometry. From further geometrical considerations, it can be shown that if the eyes are ﬁxating at a point F at a distance Z at a given instant, then the horizontal disparity H of a nearby point P at a distance Z + Z is approximately given by H≈a

Z Z2

(2.1)

where a is the interocular separation, and H is measured in radians of visual angle.1 The approximation is good provided that the spatial separation between the two points is small compared with Z. The inverse square relationship in Eq. (2.1) can be easily understood. φ1 and φ2 are the visual angles spanned by the separation PF at the two eyes, and are thus inversely proportional to the ﬁxation distance Z plus higher-order terms. Since H = φ1 − φ2 , the 1/Z term is canceled by the subtraction, and the next most important term is thus proportional to 1/Z2 . Because simple geometry provides relative depth given retinal disparity, one of the main problems of stereovision is how the brain measures disparity from the two retinal images in the ﬁrst place. Many algorithms for disparity computation have been proposed. Most of them, however, have emphasized the ecological, mathematical, or engineering aspects of the problem, while often ignoring relevant neural mechanisms. For example, a whole class of models is based on Marr and Poggio’s (1976) approach of starting with all possible matches between the features (such as dots or edges) in the two half-images of a stereogram and then introducing constraints to eliminate the false matches and compute the disparity map. These models literally assume that there are cells that respond to only a speciﬁc match and nothing else. In reality, even the most sharply tuned binocular cells respond to a range of disparities (Nikara et al., 1968; Maske et al., 1984; Bishop and Pettigrew, 1986; Poggio and Fischer, 1977; Poggio and Poggio, 1984). If these models are revised to use realistic, distributed disparity representation, then it is not known how to implement the constraints needed for disparity computation (Assee and Qian, 2007). The style of disparity computation in the brain seems to be fundamentally different from these models (Qian, 1997). In an effort to address this shortcoming, we have constructed physiologically based algorithms for disparity computation according to the quantitative properties of binocular cells in the visual cortex reported by Ohzawa and coworkers 1

The disparity of the ﬁxation point itself is usually very small (McKee and Levi, 1987; Howard, 2002) and can be assumed to be zero when it is not the subject of study.

13

14

N. Qian and Y. Li (Ohzawa et al., 1996; Freeman and Ohzawa, 1990; DeAngelis et al., 1991; Ohzawa et al., 1990, 1997). These investigators mapped binocular receptive ﬁelds (RFs) of primary visual cortical cells in detail and proposed a model for describing their responses. Let us ﬁrst consider simple cells. Two different models for describing binocular simple-cell RFs have been proposed. Early physiological studies suggested that there is an overall positional shift between the left and right RFs of binocular simple cells (Bishop and Pettigrew, 1986). The shapes of the two RF proﬁles of a given cell were assumed to be identical (Figure 2.2a). In contrast, later quantitative studies by Ohzawa et al. (1990) have found that the left and right RF proﬁles of a simple cell often possess different shapes. These authors accounted for this ﬁnding by assuming that the RF shift is between ON/OFF

Left RF

Left RF Left fovea

Left fovea

RF shift

RF shift

Right fovea

Right fovea

Right RF (a) Position-shift RF model

Right RF (b) Phase-shift RF model

Figure 2.2 Schematic drawings illustrating the shift between the left and right receptive ﬁelds (RFs) of binocular simple cells. The “+” and “−” signs represent the ON and OFF subregions, respectively, within the RFs. Two different models for achieving the shift have been suggested by physiological experiments. (a) Position-shift model. According to this model, the left and right RFs of a simple cell have identical shapes but have an overall horizontal shift between them (Bishop and Pettigrew, 1986). (b) Phase-shift model. This model assumes that the shift is between ON/OFF subregions within the left and right RF envelopes that spatially align (Ohzawa et al., 1990, 1996; DeAngelis et al., 1991). The fovea locations on the left and right retinas are drawn as a reference point for vertically aligning the left and right RFs of a simple cell. Modiﬁed from Figure 2 of Qian (1997).

Physiologically based models of binocular depth perception subregions within identical RF envelopes that spatially align (Figure 2.2b). This new RF model is often referred to as the phase-shift, phase-difference, or phaseparameter model. For ease of description, the shift, expressed in terms of the visual angle, in both of these alternatives will be referred to as the “RF shift” (Figure 2.2) when it is not essential to distinguish between them. Later, we will discuss important differences between these two RF models. Figure 2.2 only shows the ON and OFF subregions of the RFs schematically. As is well known, the details of the RF proﬁles of simple cells can be described by the Gabor function, which is a Gaussian envelope multiplied by a sinusoidal modulation (Marcˇelja, 1980; Daugman, 1985; McLean and Palmer, 1989; Ohzawa et al., 1990; DeAngelis et al. 1991; Ohzawa et al., 1996; Anzai et al., 1999b). The Gaussian envelope determines the overall dimensions and location of the RF, while the sinusoidal modulation determines the ON and OFF subregions within the envelope. Since disparity itself is a shift between the two retinal projections (Figure 2.1), one might expect that a binocular simple cell would give the best response when the stimulus disparity happens to match the cell’s left–right RF shift. In other words, a simple cell might prefer a disparity equal to its RF shift. A population of such cells with different shifts would then prefer different disparities, and the unknown disparity of any stimulus could be computed by identifying which cell gives the strongest response to the stimulus. The reason that no stereo algorithm has come out of these considerations is because the very ﬁrst assumption – that a binocular simple cell has a preferred disparity equal to its RF shift – is not always valid; it is only true for simple patterns (such as bars or gratings) undergoing coherent motion, and not for any static patterns, nor for moving or dynamic stereograms with complex spatial proﬁles (such as random-dot patterns) (Qian, 1994; Chen et al., 2001). Simple cells cannot generally have a well-deﬁned preferred disparity, because their responses depend not only on the disparity but also on the detailed spatial structure of the stimulus (Ohzawa et al., 1990; Qian, 1994; Zhu and Qian, 1996; Qian and Zhu, 1997). Although one can measure a disparity tuning curve from a simple cell, the location of the peak of the curve (i.e., the preferred disparity) changes with some simple manipulations (such as a lateral displacement) of the stimuli. This property is formally known as Fourier phase dependence, because the spatial structure of an image is reﬂected in the phase of its Fourier transform. Because of the phase dependence, simple-cell responses cannot explain the fact that we can detect disparities in static stereograms and in complex dynamic stereograms. The phase dependence of simple-cell responses can be understood intuitively by considering the disparity tuning of a simple cell to a static vertical line. The

15

16

N. Qian and Y. Li Fourier phase of the line is directly related to the lateral position of the line, which will affect where its projection falls in the left and right RFs of the simple cell. A line with a given disparity may evoke a strong response at one lateral position because it happens to project onto the excitatory subregions of both the left and the right RFs, but may evoke a much weaker response at a different lateral position because it now stimulates some inhibitory portions of the RFs. Therefore, the response of the simple cell to a ﬁxed disparity changes with changes in the Fourier phases of the stimulus and, consequently, it cannot have a well-deﬁned preferred disparity. There is direct experimental evidence supporting this conclusion. For example, Ohzawa et al. (1990) found that the disparity tuning curves of simple cells measured with bright bars and dark bars (whose Fourier phases differ by π) were very different. The Fourier phase dependence of simple-cell responses can also explain an observation by Poggio et al. (1985), who reported that simple cells show no disparity tuning to dynamic random-dot stereograms. Each of the stereograms in their experiment maintained a constant disparity over time, but its Fourier phase was varied from frame to frame by constantly rearranging the dots. Simple cells lost their disparity tuning as a result of averaging over many different (phase-dependent) tuning curves (Qian, 1994; Chen et al., 2001). While simple cells are not generally suited for disparity computation, owing to their phase dependence, the responses of complex cells do have the desired phase independence, as expected from their lack of separate ON and OFF subregions within their RFs (Skottun et al., 1991). To build a working stereo algorithm, however, one needs to specify how this phase independence is achieved and how an unknown stimulus disparity can be recovered from these responses. Fortunately, a model for describing the responses of binocular complex cells has been proposed by Ohzawa and coworkers based on their quantitative physiological studies (Ohzawa et al., 1990, 1997; Anzai et al., and Freeman, 1999c). The model is known as the disparity energy model, since it is a binocular extension of the well-known motion energy model (Adelson and Bergen, 1985; Watson and Ahumada, 1985). Ohzawa et al. (1990) found that a binocular complex cell in the cat primary visual cortex can be simulated by summing the squared responses of a quadrature pair of simple cells, and the simple-cell responses, in turn, can be simulated by adding the visual inputs from their left and right RFs (see Figure 2.3). (Two binocular simple cells are said to form a quadrature pair if there is a quarter-cycle shift between the ON/OFF subregions of their left and right RFs (Ohzawa et al., 1990; Qian, 1994).) The remaining questions are whether the model complex cells constructed in this way are indeed independent of the Fourier phases of the stimulus and, if so, how their preferred disparities are related to their RF parameters. We have

Physiologically based models of binocular depth perception Left RF

RF shift

S1 2

( )

Right RF

C Left RF

2

( ) RF shift

S2

Right RF

Figure 2.3 The model proposed by Ohzawa et al. (1990) for describing the response of binocular complex cells. The complex cell (labeled C in the ﬁgure) sums the squared outputs of a quadrature pair of simple cells (labeled S1 and S2 ). Each simple cell, in turn, sums the contributions from its two RFs on the left and right retinas. The left RF of S2 differs from the left RF of S1 by a quarter-cycle shift. Likewise, the two right RFs also differ by a quarter-cycle shift. Several mathematically equivalent variations of model are discussed in the text. Reproduced from Figure 5 of Qian (1997).

investigated these issues through mathematical analyses and computer simulations (Qian, 1994; Zhu and Qian, 1996; Qian and Zhu, 1997). The complex-cell model was found to be independent of the Fourier phases of the stimulus for simple stimuli, including the bars used in the physiological experiments of Ohzawa et al. (1990), and its preferred disparity was approximately equal to the left–right RF shift within the constituent simple cells. For more complicated stimuli such as random-dot stereograms, however, a complex cell constructed from a single quadrature pair of simple cells is still phase-sensitive, albeit less so than simple cells. This problem can be easily solved by considering the additional physiological fact that complex cells have somewhat larger RFs than those

17

18

N. Qian and Y. Li of simple cells (Hubel and Wiesel, 1962). We incorporated this fact into the model by spatially pooling several quadrature pairs of simple cells with nearby and overlapping RFs to construct a model complex cell (Zhu and Qian, 1996; Qian and Zhu, 1997). The resulting complex cell was largely phase-independent for any stimulus, and its preferred disparity was still approximately equal to the RF shift within the constituent simple cells. With the above method for constructing reliable complex-cell responses, and the relationship derived by that method between the preferred disparity and the RF parameters, we were ﬁnally ready to develop, for the ﬁrst time, a stereo algorithm for solving stereograms using physiological properties of binocular cells (Qian, 1994; Zhu and Qian, 1996; Qian and Zhu, 1997). By using a population of complex cells tuned to the same preferred spatial frequency and with their preferred disparities covering the range of interest, the disparity of an input stimulus could be determined by identifying the cell in the population with the strongest response (or by calculating the population-averaged preferred disparity of all cells weighted by their responses). An example of the application of this algorithm to a random-dot stereogram is shown in Figure 2.4. A mathematical analysis of these model complex cells reveals that their computation is formally equivalent to summing two related cross-products of the band-pass-ﬁltered left and right image patches (Qian and Zhu, 1997). This operation is essentially an efﬁcient version of cross-correlation (Qian and Zhu, 1997; Qian and Mikaelian, 2000). Since the disparity is a shift between two retinal projections, it is certainly reasonable to use a cross-correlation-like operation to compute it. Qian and Mikaelian (2000) also compared this energy-based algorithm with the so-called phase algorithm in computer vision (Sanger, 1988; Fleet et al., 1991) (which should not be confused with the phase-shift RF model). It has been demonstrated experimentally that complex cells receive monosynaptic inputs from simple cells but not vice versa (Alonso and Martinez, 1998), as required by the model. On the other hand, there is, as yet, no direct anatomical evidence supporting the quadrature pair method for constructing binocular complex cells from simple cells. However, based on the quantitative physiological work of Ohzawa and coworkers (DeAngelis et al., 1991; Ohzawa et al., 1990, 1996, 1997), the method is at least valid as a phenomenological description of a subset of real complex-cell responses. In addition, our analyses indicate that the same phase-independent complex-cell responses can be obtained by combining the outputs of many simple cells to average out their phase sensitivity, without requiring the speciﬁc quadrature relationship (Qian, 1994; Qian and Andersen, 1997; Qian and Mikaelian, 2000). Two other variations of the model also lead to the same complex-cell responses. The ﬁrst considers the fact that cells cannot ﬁre negatively. Therefore, each simple cell in Figure 2.3 should be split into

Physiologically based models of binocular depth perception

(a) Random-dot stereogram

3 0 –3

(b) Computed disparity

Figure 2.4 A random-dot stereogram (a) and the computed disparity map (b). The stereogram has 110 × 110 pixels with a dot density of 50%. The central 50 × 50 area and the surrounding area have disparities of 2 and −2 pixels, respectively. When fused with uncrossed eyes, the central square appears further away than the surround. The disparity map of the stereogram was computed with eight complex cells (with the same spatial scale but different preferred disparities) at each location. The distance between two adjacent sampling lines represents a distance of two pixel spacings in the stereogram. Negative and positive values indicate near and far disparities, respectively. The disparity map can be improved by combining information across different scales (Chen and Qian, 2004). Modiﬁed from Figures 4 and 8 of Qian and Zhu (1997).

a push–pull pair with inverted RF proﬁles, so that they can carry the positive and negative portions of the original responses without using negative ﬁring rates (Ohzawa et al., 1990). In the second variation, the squaring operation in Figure 2.3 is considered to occur at the stage of simple-cell responses and the

19

20

N. Qian and Y. Li complex cell simply sums the simple-cell responses (Heeger, 1992; Anzai et al., 1999b,c; Chen et al., 2001). Although the disparity energy model was originally proposed based on data from cats (Ohzawa et al., 1990), later studies indicate that the same approach can be used to describe the responses of monkey binocular cells as well (Livingstone and Tsao, 1999; Cumming and DeAngelis, 2001). One difference, though, is that monkeys have a much smaller fraction of simple cells than cats do; most monkey V1 cells appear to be complex. The energy model, however, requires that there be more simple cells than complex cells. This difﬁculty could be alleviated by assuming that, for many complex cells in monkeys, a stage similar to the simple-cell responses happens in the dendritic compartments of complex cells. In other words, simple-cell-like properties could be constructed directly from inputs from the lateral geniculate nucleus to a dendritic region of a complex cell. The simple-cell-like responses from different regions of the dendritic tree are then pooled in the cell body to give rise to complex-cell response properties. This scheme is also consistent with the observation that some complex cells seem to receive direct inputs from the lateral geniculate nucleus (Alonso and Martinez, 1998).

2.3

Disparity attraction and repulsion

After demonstrating that our physiologically based method could effectively extract binocular-disparity maps from stereograms, we then applied the model to account for some interesting perceptual properties of stereopsis. For example, the model can explain the observation that we can still perceive depth when the contrasts of the two images in a stereogram are different, so long as they have the same sign, and the reliability of depth perception decreases with the contrast difference (Qian, 1994; Smallman and McKee, 1995; Qian and Zhu, 1997; Qian and Mikaelian, 2000). We also applied the model to a psychophysically observed depth illusion reported by Westheimer (1986) and Westheimer and Levi (1987). These authors found that when a few isolated features with different disparities are viewed foveally, the perceived disparity between them is smaller (attraction) or larger (repulsion) than the actual value, depending on whether their lateral separation is smaller or larger than several minutes of arc. If the separation is very large, there is no interaction between the features. We showed that these effects are a natural consequence of our disparity model (Mikaelian and Qian, 2000). The interaction between the features in the model originates from their simultaneous presence in the cells’ RFs, and by pooling across cells tuned to different frequencies and orientations, the psychophysical

Physiologically based models of binocular depth perception results can be explained without introducing any ad hoc assumptions about the connectivity of the cells (Lehky and Sejnowski, 1990).

2.4

Vertical disparity and the induced effect

We have focused so far on the computation of horizontal disparity – the primary cue for stereoscopic depth perception. It has been known since the time of Helmholtz that vertical disparities between the two retinal images can also generate depth perception (Howard, 2002). The mechanism involved, however, is more controversial. The best-known example of depth perception from vertical disparity is perhaps the so-called induced effect (Ogle, 1950): a stereogram made from two identical images but with one of them slightly magniﬁed vertically (Figure 2.5a) is perceived as a slanted surface rotated about a vertical axis (Figure 2.5b). The surface appears further away on the side with the smaller image, and the apparent axis of rotation is the vertical meridian through the point of ﬁxation (Ogle, 1950; Westheimer and Pettet, 1992). To better appreciate this phenomenon, we indicate in Figure 2.6a the signs of the depth and disparity in the four quadrants around the point of ﬁxation for the speciﬁc case of a left-image magniﬁcation. The features in the left image (ﬁlled dots) are then outside the corresponding features in the right image (open dots), as shown. The perceived slant is such that the ﬁrst and fourth quadrants appear far and the second and third quadrants appear near with respect to the ﬁxation point. It then follows that

Perceived

L

R

IMAGE

IMAGE

Actual

Observer (a)

(b)

Figure 2.5 (a) A schematic stereogram for the induced effect (Ogle, 1950). The left eye’s view (L) is magniﬁed vertically with respect to the right eye’s view (R). (b) With a stereogram like that in (a), a slanted surface is perceived, shown schematically in the top view, as if the right image had been magniﬁed horizontally (Ogle, 1950). Reproduced from Figure 1 of Matthews et al. (2003).

21

22

N. Qian and Y. Li II

I

II

I

Near

Far

Near

Far

Near

Far

Near

Far

III

IV

III

IV

Left Right

(a) Induced effect

(b) Geometric effect

Figure 2.6 The signs of the disparity and depth for (a) the induced effect (vertical disparity) and (b) the geometric effect (horizontal disparity). For clarity, the features in the left and right images are represented schematically by ﬁlled and open dots, respectively. In each panel, the ﬁxation point is at the center of the cross, which divides the space into four quadrants. The arrows indicate the signs of the disparity in the four quadrants caused by (a) a vertical magniﬁcation in the left eye and (b) a horizontal magniﬁcation in the right eye. The sign of the perceived depth (near or far) in each quadrant is also indicated. Note that the depth sign of the vertical disparity is quadrant-dependent (Westheimer and Pettet, 1992), while that of the horizontal disparity is not. Reproduced from Figure 2 of Matthews et al. (2003).

the opposite vertical-disparity signs in the ﬁrst and fourth quadrants generate the same depth sign (far), and that the same vertical-disparity signs in the ﬁrst and second quadrants generate opposite depth signs (far and near, respectively). In other words, the depth sign of a given vertical disparity depends on the quadrants around the ﬁxation point (Westheimer and Pettet, 1992). To generate the same kind of surface slant with horizontal disparity (termed the “geometric effect” by Ogle (1950)), one would have to magnify the right image horizontally. Unlike the case for the vertical disparity, however, the depth sign of the horizontal disparity is ﬁxed and independent of the quadrant (Figure 2.6b). These and other considerations have led to the widely accepted notion that the role of vertical disparity is fundamentally different from that of horizontal disparity. In particular, since the vertical disparity is large at large retinal or gaze eccentricity and does not have a consistent local depth sign, and since the effect of vertical disparity can be best demonstrated with large stimuli (Rogers and Bradshaw, 1993; Howard and Kaneko, 1994) and appears to be averaged over greater areas than that of horizontal disparity (Kaneko and Howard, 1997), it is believed that the effect of vertical disparity is global, while the effect of horizontal disparity is local. Numerous theories of vertical disparity have been proposed (Ogle, 1950; Koenderink and van Doorn, 1976; Arditi et al., 1981; Mayhew and Longuet-Higgins, 1982; Gillam and Lawergren, 1983; Rogers and

Physiologically based models of binocular depth perception Bradshaw, 1993; Howard and Kaneko, 1994; Liu et al., 1994; Gårding et al., 1995; Banks and Backus, 1998; Backus et al., 1999); many of them employ some form of global assumption to explain the induced effect. For example, Mayhew and Longuet-Higgins (Mayhew, 1982; Mayhew and Longuet-Higgins, 1982) proposed that the unequal vertical image sizes in the two eyes are used to estimate two key parameters of the viewing system: the absolute ﬁxation distance and the gaze angle. Since the horizontal disparity is dependent on these parameters, the estimated parameters will modify the interpretation of horizontal disparity globally, and hence the global depth effect of vertical disparity. There are, however, several challenges to this theory. First, the predicted depth-scaling effect of vertical disparity cannot be observed with display sizes ranging from 11◦ (Cumming et al., 1991) to 30◦ (Sobel and Collett, 1991). The common argument that these displays are simply not large enough is unsatisfactory because the induced effect can be perceived with these display sizes. Furthermore, even with stimuli as large as 75◦ , the observed scaling effect is much weaker than the prediction (Rogers and Bradshaw, 1993). Second, the predicted gaze-angle shift caused by vertical magniﬁcation is never perceived, and additional assumptions are needed to explain this problem (Bishop, 1996). Third, to account for the results under certain stimulus conditions, the theory has to assume that multiple sets of viewing-system parameters are used by the visual system at the same time, an unlikely event (Rogers and Koenderink, 1986). A general problem applicable to all purely global interpretations of vertical disparity, including the theory of Mayhew and Longuet-Higgins, is that vertical disparity can generate reliable (albeit relatively weak) local depths even in small displays that are viewed foveally (Westheimer, 1984; Westheimer and Pettet, 1992; Matthews et al., 2003). One might argue that, functionally, the depth effect of vertical disparity in small displays is not as important as the induced effect in the case of large stimuli because the vertical disparity is usually negligible near the fovea, while full-ﬁeld vertical size differences between the eyes can occur naturally with eccentric gaze. However, as pointed out by Farell (1998), the vertical disparity can be quite large even near the fovea when oriented contours in depth are viewed through narrow vertical apertures. This situation is illustrated in Figure 2.7a. When the apertures are narrow enough, the horizontal disparity is largely eliminated and subjects have to rely on vertical disparity to make local depth judgments. We have proposed a new theory for depth perception from vertical disparity (Matthews et al., 2003) based on the oriented binocular RFs of visual cortical cells (Ohzawa et al., 1990, 1996, 1997; DeAngelis et al., 1991; Anzai et al., 1999b,c) and on the radial bias of the preferred-orientation distribution in the cortex (Bauer et al., 1983; Leventhal, 1983; Bauer and Dow, 1989; Vidyasagar and Henry, 1990).

23

24

N. Qian and Y. Li L

R

(a) Vertical disparity from occluded orientation

tL tR

(b) Interocular time delay from occluded motion

Figure 2.7 (a) An illustration of how vertical disparity can arise from horizontal disparity carried by oriented contours (Farell, 1998). The vertical occluders have zero disparity, while the diagonal line has a far horizontal disparity between its left (L) and right (R) images. The visible segments between the occluders have disparities mainly in the vertical dimension. (b) An analogous illustration of how interocular time delay can arise from horizontal disparity carried by moving targets (i.e., oriented contours in the spatiotemporal space) (Burr and Ross, 1979). The moving dot in the ﬁgure has a far horizontal disparity, but when viewed through the apertures between the occluders, it appears at the same spatial locations (i.e., the locations of the apertures) but at different times. If the y axis in (a) represents time, then (a) is the spatiotemporal representation of (b). Reproduced from Figure 4 of Matthews et al. (2003).

It can be shown within the framework of the disparity energy method that cells with preferred horizontal and vertical spatial frequencies ωx0 and ωy0 (and thus the same preferred orientation θ) may treat a vertical disparity V in the stimulus as an equivalent horizontal disparity given by (Matthews et al., 2003) Hequiv =

ωy0 ωx0

V =−

V . tan θ

(2.2)

The second equality holds because tan θ = −ωy0 /ωx0 when θ is measured counterclockwise from the positive horizontal axis.2 Figure 2.8 provides an intuitive explanation of why oriented cells may treat a vertical disparity as an equivalent horizontal disparity. An orientation-tuned cell with a vertical offset between its left and right RFs can be approximately 2

The negative sign is needed because when tan θ is positive as in Figure 2.8, ωx0 and ωy0 have opposite signs according to the formal conventions of the Fourier transform.

Physiologically based models of binocular depth perception R L

}V λy θ

λx

λy

y

λx

V H equiv

x (a)

(b)

Figure 2.8 A geometric explanation of Eq. (2.2). (a) Parallel lines are drawn through the boundaries of the ON and OFF subregions of an RF proﬁle. The horizontal and vertical distances between these lines are approximately equal to half of the preferred horizontal spatial period and half of the preferred vertical spatial period, respectively, of the cell. (b) If the left and right RFs have a vertical shift V , an equivalent horizontal shift of Hequiv is introduced.

viewed as having an equivalent horizontal offset instead. Therefore, the cell may treat a vertical disparity in the stimulus as an equivalent horizontal disparity because, most of the time, horizontal disparity is more signiﬁcant than vertical disparity owing to the horizontal separation of the eyes. To determine the equivalent horizontal disparity, note that the horizontal and vertical distances between the two adjacent parallel lines marking the ON and OFF subregions of the RFs are approximately equal to half of the preferred horizontal spatial period λx and half of the preferred vertical spatial period λy respectively, of the cell (Figure 2.8). Now suppose there is a vertical shift of V between the left and right RFs (Figure 2.8b). It is obvious that the equivalent horizontal shift is given by Hequiv =

0 ωy λx V , V= V =− λy tan θ ωx0

which is the same as Eq. (2.2). The second equality holds because spatial periods are inversely related to the corresponding spatial frequencies. The negative sign in Eq. (2.2) is a consequence of the fact that we deﬁne both positive horizontal and positive vertical disparities in the same way (for example, as the right image position minus the left image position). For the oriented RFs shown in Figure 2.8, a positive V must lead to a negative Hequiv . How can Eq. (2.2) account for the perceived depth in stereograms containing vertical disparities? According to Eq. (2.2), cells with a preferred orientation θ would treat a vertical disparity V as an equivalent horizontal disparity

25

26

N. Qian and Y. Li (−V / tan θ ). For stimuli without a dominant orientation, such as random textures, cells tuned to all orientations, with both positive and negative signs of tan θ , will be activated. These cells will report equivalent horizontal disparities of different signs and magnitudes, and the average result across all cells should be near zero. The only possibility of seeing depth from vertical disparity in stimuli without a dominant orientation arises when certain orientations are overrepresented by the cells in the visual cortex and, consequently, their equivalent horizontal disparities are not completely averaged out after pooling across cells tuned to all orientations. On the other hand, if the stimuli do have a strong orientation θs , the cells with preferred orientation θ = θs will be maximally activated and the equivalent horizontal disparity they report should survive orientation pooling. Therefore, depth perception from vertical disparity should be most effective for stimuli with a strong orientation, but will usually be less effective than horizontal disparity (Westheimer, 1984), since most stimuli will activate cells tuned to different orientations, and pooling across orientations will make the equivalent horizontal disparities weaker. A near-vertical orientation of the stimulus, however, will not easily allow cortical cells to convert a vertical disparity into an equivalent horizontal disparity, because vertically tuned cells have ωy0 = 0 in Eq. (2.2). Similarly, a near-horizontal orientation will not be effective either, since the equivalent horizontal disparity will be too large (owing to the vanishing of ωx0 ) to be detected (unless V approaches zero). Therefore, the theory predicts that the best orientation of a stimulus for perceiving depth from vertical disparity should be around a diagonal axis. A critical test of our theory is whether it can explain the well-known induced effect (Ogle, 1950): a stereogram made from two identical images but with one of them slightly magniﬁed vertically is perceived as a surface rotated about the vertical axis going through the point of ﬁxation (Figure 2.9a). First note that the induced effect can be observed in stimuli having no dominant orientation, such as random textures (Ogle, 1950). Therefore, according to the above discussion, a reliable equivalent horizontal disparity could be generated only by an overrepresentation of certain orientations in the brain. Remarkably, physiological experiments have established well a radial bias of preferred orientations around the ﬁxation point in the cat primary visual cortex (Leventhal, 1983; Vidyasagar and Henry, 1990) and in the supragranular layers3 of the monkey area V1 (Bauer et al., 1983; Bauer and Dow, 1989). That is, although the full range of orientations is represented for every spatial location, the orientation connecting each location and the ﬁxation point is overrepresented at that location (Figure 2.9b). 3

The supragranular layers are known to project to higher visual cortical areas (Felleman and Van Essen, 1991), and are thus more likely to be relevant than the other layers for perception.

Physiologically based models of binocular depth perception II

I

Near

Far

Near

Far

III

IV

2

1

Left Right

(a)

(b)

(c)

Figure 2.9 Our explanation for the induced effect and the related quadrant dependence of vertical disparity. (a) The signs of the point disparity and depth in the four quadrants around the ﬁxation point caused by a magniﬁcation of the left image (as in Figure 2.6a). Features in the left and right images are represented by ﬁlled and open dots, respectively. The signs of the vertical disparities are indicated by arrows, and the depth signs are marked as “near” or “far”. (b) The radial bias (dashed lines) of the preferred orientations around the ﬁxation point (central cross) found in the visual cortex. For example, the 45◦ orientation and the vertical orientation are overrepresented for spatial locations 1 and 2, respectively. (c) Conversion of vertical disparity into equivalent horizontal disparity by the overrepresented cortical cells in the four quadrants. The four vertical-disparity arrows are copied from (a), and the four horizontal arrows indicate the signs of the equivalent horizontal disparities according to the overrepresented orientations (dashed lines) and Eq. (2.2). Reproduced from Figure 6 of Matthews et al. (2003).

This is precisely what is needed for explaining the induced effect and the related quadrant dependence of the vertical disparity for stimuli without a dominant orientation (Figure 2.9c). To be more quantitative, let the ﬁxation point be the origin and assume that the left image is magniﬁed vertically by a factor of k (> 1). Then, the vertical disparity at the stimulus location (x, y) is V (x, y) = (k − 1)y. The radial bias means that the cortically overrepresented orientation for the location (x, y) is given by tan θ = y/x. Then, according to Eq. (2.2), the corresponding equivalent horizontal disparity should be Hequiv (x, y) = −

(k − 1)y = −(k − 1)x. tan θ

(2.3)

Therefore, although the vertical magniﬁcation of the left image by a factor of k creates a vertical disparity of (k − 1)y at the location (x, y), the overrepresented equivalent horizontal disparity is −(k − 1)x, and could be mimicked by magnifying the right image horizontally by a factor of k. The perceived surface should thus be rotated around the vertical axis going through the ﬁxation point, which is consistent with psychophysical observations (Ogle, 1950). Note that the radial bias does not affect the depth perceived from real horizontal

27

28

N. Qian and Y. Li disparity, since, unlike vertical disparity, horizontal disparity is not subject to an orientation-dependent conversion. We mentioned that the quadrant dependence of the vertical disparity means that the vertical disparity does not have a consistent local depth sign, and this may seem to imply that the induced effect can be explained only by global considerations. However, we have shown above that our local theory can account for the phenomenon very well through an orientation-dependent conversion of vertical disparity into an equivalent horizontal disparity. Our theory is consistent with the ﬁnding that vertical disparity is more effective at larger display sizes (Rogers and Bradshaw, 1993; Howard and Kaneko, 1994) and with the related observation that vertical disparity appears to operate at a more global scale than horizontal disparity (Kaneko and Howard, 1997). This is because the radial bias of cells’ preferred orientations is stronger at higher eccentricities (Leventhal, 1983), although the bias is also present for foveal cells in monkey area V1 (Bauer et al., 1983; Bauer and Dow, 1989). Larger displays cover more eccentric locations, and are therefore more effective. For small displays, the effect of vertical disparity is harder to observe because of the weaker radial orientation bias in the brain; however, our theory predicts that the effect can be made stronger by using a near-diagonal orientation of the stimulus. Our theory predicts further that when there is both horizontal and vertical disparity, the total horizontal disparity should be equal to the actual horizontal disparity plus the equivalent horizontal disparity generated by the vertical disparity. Therefore, these two types of disparity should locally enhance or cancel each other depending on their depth signs. We have tested and conﬁrmed these predictions using diagonally oriented stimuli (Matthews et al., 2003). Our theory also makes speciﬁc physiological predictions. First, there should be a population of V1 cells that shows both disparity tuning and orientation bias, and the bias should be stronger at greater eccentricity. Second, V1 cells’ responses to a given vertical disparity should depend on their preferred orientation. These predictions have been conﬁrmed in a subsequent physiological study by Durand et al. (2006), who concluded that “our results directly demonstrate both assumptions of this model.” 2.5

Relative versus absolute disparity

The disparity deﬁned in Figure 2.1 – the positional difference between the left and right retinal projections of a point in space – is more precisely called the absolute disparity. The difference between the absolute disparities of two points is termed the relative disparity between those points. Since the ﬁxation

Physiologically based models of binocular depth perception point disparity is usually very small and stable (McKee and Levi, 1987; Howard and Rogers, 1995), the absolute disparity of a point is approximately equal to the relative disparity between that point and the ﬁxation point. It is therefore difﬁcult to distinguish between the two types of disparity under most normal viewing conditions, where many points with different disparities are present and one of the points is the ﬁxation point at any given instant. One might hope to create a situation without relative disparity and with only absolute disparity by presenting a stimulus with a single disparity. However, under this condition, the stimulus will trigger a vergence eye movement, which quickly reduces the absolute disparity to near zero. In the laboratory, it is possible to use a feedback loop to maintain a constant absolute disparity (Rashbass and Westheimer, 1961). With such a procedure, it has been shown that V1 cells encode absolute disparity (Cumming and Parker, 1999, 2000). Since binocular depth perception is known to rely mainly on relative disparity (Westheimer, 1979; Howard and Rogers, 1995), it is thus possible that a higher visual cortical area converts absolute disparity into relative disparity through simple subtraction (Cumming and DeAngelis, 2001; Neri et al., 2004; Umeda et al., 2007). Although we have constructed our models based on V1 RF properties, we do not infer that binocular depth perception necessarily happens in V1; later stages may have similar RF properties, or may simply inherit and reﬁne V1 responses to generate perception. On the other hand, it is neither economical nor necessary for the brain to encode relative disparity across the entire binocular visual ﬁeld. Assume that the brain has computed absolute disparities at N points in a scene. Since there are N(N − 1) ordered pairs of the N points, a much greater amount of resources would be required for the brain to convert and store all the N(N − 1) relative-disparity values. An alternative possibility is that the brain might simply use absolute disparity across the whole ﬁeld as an implicit representation of the relative disparity, and compute the relative disparity explicitly only for the pair of points under attentional comparison at any given time. The fact that depth perception from a single absolute disparity is poor may be a simple reﬂection of poor depth judgment from vergence. One might argue that a relative-disparity map is more economical because, unlike absolute disparity, it does not change with vergence and thus does not have to be recomputed with each vergence eye movement. However, since saccades and head/body movements are frequent, and the world is usually not static, the brain has to recompute the disparity map frequently anyway. Also, the fact that V1 encodes absolute disparity suggests that it might be too difﬁcult to compute relative disparity directly without computing absolute disparity ﬁrst.

29

30

N. Qian and Y. Li 2.6

Phase-shift and position-shift RF models and a coarse-to-fine stereo algorithm

We mentioned earlier that two different models for binocular simplecell RFs have been proposed: the position-shift and phase-shift models (Figure 2.2). Much of what we have discussed above applies to both RF models. However, there are also important differences between them (Zhu and Qian, 1996; Qian and Mikaelian, 2000; Chen and Qian, 2004). For example, we have analyzed disparity tuning to sinusoidal gratings and broadband noise (such as random-dot stereograms) for the position- and phase-shift models (see Eqs. (2.11)–(2.15) in Zhu and Qian (1996) and related work in Fleet et al. (1996)). For a complex cell with a phase shift φ between the left and right RFs and a preferred spatial frequency ω0 , its peak response to noise occurs at the preferred disparity phs

Dnoise =

φ . ω0

(2.4)

Around this disparity, one peak in the periodic response to a sinusoidal grating with spatial frequency occurs at phs

Dsin =

φ phs ω0 = Dnoise .

(2.5)

In contrast, for a cell with a positional shift d, these peaks are all aligned at d, the cell’s preferred disparity: Dpos = d.

(2.6)

Therefore, near a cell’s preferred disparity for noise stimuli, the preferred disparity for sinusoidal gratings depends on the spatial frequency of the grating for phase-shift RFs but not for position-shift RFs (Zhu and Qian, 1996). Such a dependence has been observed in the visual Wulst of the barn owl (Wagner and Frost, 1993), supporting the phase-shift model originally proposed for the cat V1 (Ohzawa et al., 1990). On the other hand, the preferred disparity of phase-shift cells is limited to plus or minus half of the preferred spatial period of the cells (Blake and Wilson, 1991; Freeman and Ohzawa, 1990; Qian, 1994; Smallman and MacLeod, 1994), and some real cells in the barn owl do not follow this constraint strictly (Zhu and Qian, 1996). It thus appears that both the phase- and the position-shift RF mechanisms are used to code disparity. Later physiological experiments on cats and monkeys have conﬁrmed that a mixture of the two RF models is the best description of the binocular cells in these species (Anzai et al., 1997, 1999a; Cumming and DeAngelis, 2001).

Physiologically based models of binocular depth perception The above discussion of the two RF models prompted Chen and Qian (2004) to ask “what are the relative strengths and weaknesses of the phase- and the position-shift mechanisms in disparity computation, and what is the advantage, if any, of having both mechanisms?” With appropriate parameters, either type of RF model (or a hybrid of them) can be used as a front-end ﬁlter in the energy method for disparity computation described earlier (Qian and Zhu, 1997). However, our analysis and our simulations over a much wider range of parameters reveal some interesting differences between the two RF models in terms of disparity computation (Chen and Qian, 2004). The main ﬁnding is that the phase-shift RF model is, in general, more reliable (i.e., less variable) than the position-shift RF model for disparity computation. The accuracy of the computed disparity is very good for both RF models at small disparity, but it deteriorates at large disparity. In particular, the phase-shift model tends to underestimate the magnitude of the disparity owing to a zero-disparity bias (Qian and Zhu, 1997). Additionally, both RF models are only capable of dealing well with disparity within plus or minus half of the preferred spatial period of the cells. This was known earlier for the phase-shift model (see above). It turns out that the position-shift model has a similar limitation: although position-shift cells can have large preferred disparities, the responses of a population of them for disparity computation often have false peaks at large preferred disparities (Chen and Qian, 2004). These results and the physiological data of Menz and Freeman (2003) suggest a coarse-to-ﬁne stereo algorithm that takes advantage of both the phase-shift and the position-shift mechanisms (Chen and Qian, 2004). In this algorithm, disparity computation is always performed by the phase-shift mechanism because of its higher reliability over the entire disparity range. Since the phase-shift model is accurate only when the disparity is small, the algorithm iteratively reduces the magnitude of the disparity through a set of spatial scales by introducing a constant position-shift component for all cells to offset the stimulus disparity. Speciﬁcally, for a given stereogram, a rough disparity map is ﬁrst computed with the phase-shift model at a coarse scale using the energy method (Qian, 1994). The computed disparity at each spatial position is then used as a constant position-shift component for all cells at the next, ﬁner scale. At the next scale, different cells all have the same position-shift component but different phase-shift components so that the disparity computation is still done by the reliable phase-shift mechanism. The amount of disparity that the phaseshift component has to deal with, however, has been reduced by the common position-shift component of all cells, and the new disparity estimated from the phase-shift component will thus be more accurate. The process can be repeated across several scales. We have implemented such a coarse-to-ﬁne algorithm and

31

32

N. Qian and Y. Li found that it can indeed greatly improve the quality of computed disparity maps (Chen and Qian, 2004). This coarse-to-ﬁne algorithm is similar in spirit to the one originally proposed by Marr and Poggio (1979), but with two major differences. First, we have used the position-shift component of the RFs to reduce the magnitude of the disparity at each location, while Marr and Poggio (1979) used vergence eye movement, which changes the disparity globally. Second, at each scale we have used the energy method for disparity computation, while Marr and Poggio (1979) used a nonphysiological, feature-matching procedure. 2.7

Are cells with phase-shift receptive fields lie detectors?

Read and Cumming (2007) asked the same question of why there are both phase- and position-shift RF mechanisms in the brain, but reached a different conclusion. They argued that cells with position-shift RFs code real, physical disparities while those with phase-shift RFs code impossible, nonphysical disparities and are thus “lie detectors.” In particular, they believe that cells with phase-shift RFs “respond optimally to [impossible] stimuli in which the left and right eye’s images are related by a constant shift in Fourier phase.” It is not clear how they reached this conclusion. The phase-shift model assumes that the sinusoids of the Gabor functions for the left and right RFs have a phase shift; mathematically, however, this is not equivalent to a constant phase shift of the RFs’ Fourier components. Read and Cumming (2007) deﬁned an impossible stimulus as a visual input that “never occurs naturally, ... even though it can be simulated in the laboratory.” They considered a cell as coding impossible stimuli, and thus as a lie detector, if the cell responds better or shows greater response modulation to impossible stimuli than to naturally occurring stimuli (see also Haefner and Cumming, 2008). Unfortunately, this deﬁnition is problematic because, according to it, nearly all visual cells should be classiﬁed as lie detectors coding impossible stimuli. To begin with, most visual cells have retinally based RFs. To stimulate these cells optimally, the stimulus has to match the retinal location and size of the RFs. This means that the stimulus has to move with the eyes, have the right size, and be placed at the right location and distance from the eyes. Such stimuli never happen naturally. We therefore conclude that the notion of dividing cells into those coding physical and those coding impossible stimuli is not compelling. Visual cells generally respond better to artiﬁcial stimuli tailored to match their RF properties than to naturally occurring stimuli. That does not mean that they are designed to code impossible stimuli. Read and Cumming (2007) also disputed Chen and Qian’s (2004) conclusion that phase-shift cells are more reliable than position-shift cells for disparity

Physiologically based models of binocular depth perception computation. They correctly pointed out that the distribution of the computed disparity depends on whether the stimulus disparity is introduced symmetrically or asymmetrically between the two eyes. However, our recent simulations (Yongjie Li, Yuzhi Chen, and Ning Qian, unpublished observations) show that, regardless of the symmetry, the disparity distribution computed with position-shift cells always has more outliers, and consequently a much larger standard deviation, than has the distribution computed using the phase-shift RF model. We thus maintain our conclusion that phase-shift cells are more reliable than position-shift cells for disparity computation. Read and Cumming (2007) emphasized that the population response curve for the position-shift model is symmetric when the stimulus disparity is introduced symmetrically. However, this is only true for stimuli containing a single, uniform disparity and is thus not useful for general disparity computation. Finally, Read and Cumming (2007) proposed a new algorithm for disparity computation. A close examination reveals that this algorithm and the earlier algorithm of Chen and Qian (2004) search for the same goal in a space covered by cells with various combinations of phase shifts and position shifts, but with different search strategies. The common goal is a set of cells all having the same position-shift component, equal to the stimulus disparity, and whose phase-shift component encodes zero disparity. When multiple scales are considered, Chen and Qian’s (2004) coarse-to-ﬁne algorithm is more efﬁcient as it involves only a single disparity computation with phase-shift cells at each scale, while Read and Cumming’s (2007) algorithm involves multiple disparity computations, also with phase-shift cells, at each scale. Interestingly, Read and Cumming’s (2007) algorithm employs far more phase-shift-based computation than position-shiftbased computation and thus also takes advantage of the better reliability of the phase-shift RF mechanism. 2.8

Motion–stereo integration

There is increasing psychophysical and physiological evidence indicating that motion detection and stereoscopic depth perception are processed together in the brain (Regan and Beverley, 1973; Nawrot and Blake, 1989; Qian et al., 1994a; Maunsell and Van Essen, 1983; Bradley et al., 1995; Ohzawa et al., 1996). We have demonstrated that, under physiologically plausible assumptions about the spatiotemporal properties of binocular cells, the stereo energy model reviewed above can be naturally combined with the motion energy model (Adelson and Bergen, 1985; Watson and Ahumada, 1985) to achieve motion– stereo integration (Qian and Andersen, 1997). The cells in the model are tuned to both motion and disparity just like physiologically observed cells, and a

33

34

N. Qian and Y. Li population of complex cells covering a range of motion and a range of disparity combinatorially could simultaneously compute the motion and disparity of a stimulus. Interestingly, the complex cells in the integrated model are much more sensitive to motion along constant-disparity planes than to motion in depth towards or away from the observer because the left and right RFs of a cell have the same spatiotemporal orientation (Qian, 1994; Ohzawa et al., 1996, 1997; Qian and Andersen, 1997; Chen et al., 2001). This property is consistent with the physiological ﬁnding that few cells in the visual cortex are truly tuned to motion in depth (Maunsell and Van Essen, 1983; Ohzawa, et al., 1996, 1997) and with the psychophysical observation that human subjects are poor at detecting motion in depth based on disparity cues alone (Westheimer, 1990; Cumming and Parker, 1994; Harris et al., 1998). Because of this property, motion information could help reduce the number of possible stereoscopic matches in an ambiguous stereogram by making stereo matches in frontoparallel planes more perceptually prominent than matches of motion in depth. The integrated model has also been used to explain the additional psychophysical observation that adding binocular-disparity cues to a stimulus can help improve the perception of multiple and overlapping motion ﬁelds in the stimulus (i.e., motion transparency) (Qian et al., 1994b). In this explanation, it is assumed that transparent motion is usually harder to be perceived than unidirectional motion because, in area MT, motion signals from different directions suppress each other (Snowden et al., 1991; Qian and Andersen, 1994). The facilitation of transparentmotion perception by disparity can then be accounted for by assuming that the suppression in area MT is relieved when the motion signals from different directions are in different disparity planes (Qian et al., 1994a,b). This prediction of disparity-gated motion suppression in area MT has subsequently been veriﬁed physiologically (Bradley et al., 1995). Finally, the integrated motion–stereo model has allowed us to explain many temporal aspects of disparity tuning (Chen et al., 2001).

2.9

Interocular time delay and Pulfrich effects

Another interesting application of the integrated motion–stereo model is a uniﬁed explanation for a family of Pulfrich-like depth illusions. The classical Pulfrich effect refers to the observation that a pendulum oscillating back and forth in a frontoparallel plane appears to move along an elliptical path in depth when a neutral density ﬁlter is placed in front of one eye (Figure 2.10). The direction of apparent rotation is such that the pendulum appears to move away

Physiologically based models of binocular depth perception +

d x

0 _

Perceived path of pendulum

Actual path of pendulum

Filter

Observer

Figure 2.10 A schematic drawing of the classical Pulfrich effect (top view). A pendulum is oscillating in the frontoparallel plane indicated by the solid line. When a neutral density ﬁlter is placed in front of the right eye, the pendulum appears to move along an elliptical path in depth, as indicated by the dashed line. The direction of rotation is such that the pendulum appears to move away from the covered eye and towards the uncovered eye. Reproduced from Figure 1 of Qian and Andersen (1997).

from the covered eye and towards the uncovered eye. It is known that, by reducing the amount of light reaching the covered retina, the ﬁlter introduces a temporal delay in the transmission of visual information from that retina to the cortex (Mansﬁeld and Daugman, 1978; Carney et al., 1989). The traditional explanation of this illusion is that, since the pendulum is moving, when the uncovered eye sees the pendulum at one position, the eye with the ﬁlter sees the pendulum at a different position back in time. In other words, the coherent motion of the pendulum converts the interocular time delay into a horizontal disparity at the level of stimuli. However, the Pulfrich depth effect is present even with dynamic noise patterns (Tyler, 1974; Falk, 1980), which lack the coherent motion required for this conversion. Furthermore, the effect is still present when a stroboscopic dot undergoing apparent motion is used such that the two eyes see the dot at exactly the same set of spatial locations but slightly different times (Morgan and Thompson, 1975; Burr and Ross, 1979). Under this condition, the traditional explanation of the Pulfrich effect fails because no conventionally deﬁned spatial disparity exists. It has been suggested that more

35

36

N. Qian and Y. Li than one mechanism may be responsible for these phenomena (Burr and Ross, 1979; Poggio and Poggio, 1984). The stroboscopic version of the Pulfrich effect can occur in the real world when a target moves behind a set of small apertures (Morgan and Thompson, 1975; Burr and Ross, 1979) (Figure 2.7b). Without the occluders, the moving target has a horizontal disparity with respect to the ﬁxation point. With the occluders, the target appears to the two eyes to be at the same aperture locations but at slightly different times. For example, in Figure 2.7b, the target appears at the location of the central aperture at times tL and tR . In this type of situation, the brain has to rely on interocular time delay to infer the depth of the target. Our mathematical analyses and computer simulations indicate that all three Pulfrich–like phenomena can be explained in a uniﬁed way by the integrated motion–stereo model (Qian and Andersen, 1997). Central to the explanation is a mathematical demonstration that a model complex cell with physiologically observed spatiotemporal properties cannot distinguish an interocular time delay t from an equivalent horizontal disparity given by Hequiv =

ωt0 ωx0

t,

(2.7)

where ωt0 and ωx0 are the preferred temporal and horizontal spatial frequencies of the cell. This relation is analogous to Eq. (2.2), where a vertical disparity is treated as an equivalent horizontal disparity by binocular cells. It holds for any arbitrary spatiotemporal pattern (including a coherently moving pendulum, dynamic noise, and stroboscopic stimuli) that can signiﬁcantly activate the cell. By considering the population responses of a family of cells with a wide range of disparity and motion parameters, all major observations regarding Pulfrich’s pendulum and its generalizations to dynamic noise patterns and stroboscopic stimuli can be explained (Qian and Andersen, 1997). An example of a simulation for a stroboscopic pendulum is shown in Figure 2.11. Two testable predictions were made based on the analysis (Qian and Andersen, 1997). First, the responses of a binocular complex cell to interocular time delay and binocular disparity should be related according to Eq. (2.7). This prediction was conﬁrmed by later physiological recordings by Anzai et al. (2001), who concluded that “our data provide direct physiological evidence that supports the [Qian and Andersen] model.” The second prediction is also based on Eq. (2.7). The equation predicts that cells with different preferred spatial-totemporal frequency ratios will individually “report” different apparent Pulfrich depths for a given temporal delay. If we assume that the perceived depth corresponds to the disparities reported by the most responsive cells in a population (or by the population average of all cells weighted by their responses), then

Physiologically based models of binocular depth perception Right x

Left t (a) x position of the pendulum as a function of time t 127 0 –127 4 d x

0 –4 0

t

400

800

(b) Computed equivalent disparity Figure 2.11 (a) A spatiotemporal representation of a stroboscopic pendulum for one full cycle of oscillation. The two dots in each pair are for the left and the right eye respectively; they are presented at exactly the same spatial location (i.e., the same x) but slightly different times. The time gap between the two sets of dots and the duration of each dot (i.e., the size of a dot along the time axis) are exaggerated in this drawing for the purpose of illustration. (b) The computed equivalent disparity as a function of horizontal position and time. The data points from the simulation are shown as small solid circles. Lines are drawn from the data points to the x–t plane in order to indicate the spatiotemporal location of each data point. The pendulum has negative equivalent disparity (and therefore is seen as closer to the observer) when it is moving to the right, and has positive equivalent disparity (it is seen as further away from the observer) when it is moving to the left. The projection of the 3D plot onto the d–x plane forms a closed path similar to the ellipse in Figure 2.10. The units are arbitrary, measured by the pixel sizes along the space and time dimensions used in the simulation. Reproduced from Figure 4 of Qian and Andersen (1997).

the perceived Pulfrich depth should vary according to Eq. (2.7) as we selectively excite different populations of cells by using stimuli with different spatialand temporal-frequency contents. Psychophysical data are consistent with this prediction (Wist et al., 1977; Morgan and Fahle, 2000). Our Pulfrich model (Qian and Andersen, 1997) has since been known as the joint motion–disparity coding model. Despite its success, the model was

37

38

N. Qian and Y. Li questioned by Read and Cumming (2005a,b), who argued that all Pulfrich effects can be explained by a model that codes motion and disparity separately. Read and Cumming focused on the S-shaped curves of perceived disparity as a function of interocular time delay in the stroboscopic Pulfrich effect (Morgan, 1979). However, we have demonstrated fundamental problems with Read and Cumming’s work in terms of causality, physiological plausibility, and deﬁnitions of joint and separate coding, and we have compared the two coding schemes under physiologically plausible assumptions (Qian and Freeman, 2009). We showed that joint coding of disparity and either unidirectional or bidirectional motion selectivity can account for the S curves, but unidirectional selectivity is required to explain direction–depth contingency in Pulfrich effects. In contrast, separate coding can explain neither the S curves nor the direction–depth contingency. We conclude that Pulfrich phenomena can be logically accounted for by joint encoding of unidirectional motion and disparity. 2.10

Concluding remarks

Above, we have reviewed some of our work on physiologically based models of binocular depth perception. Our work was aimed at addressing the limitations of the current experimental and computational methods. Although experimental studies are fundamental to our understanding of visual information processing, these studies do not directly provide algorithms for how a population of cells with known properties may be used to solve a difﬁcult perceptual problem. For example, knowing that there are tuned near- and fardisparity-selective cells in the visual cortex does not tell us how to compute disparity maps from arbitrary stereograms with these cells. Without quantitative modeling, our intuition may often be incomplete or even wrong, and it has only limited power in relating and comprehending a large amount of experimental data. On the other hand, most computational studies of visual perception have typically been concerned with the ecological or engineering aspects of a task, while giving little or at best secondary consideration to existing physiological data. This tradition appears to stem from David Marr’s overemphasis on separating computational analyses from physiological implementations (Marr, 1982). Although purely computational approaches are highly interesting in their own right, the problem is that without paying close attention to physiology, one often comes up with theories that work in some sense but have little to do with the mechanisms used by the brain. In fact, most computer vision algorithms contain nonphysiological procedures.

Physiologically based models of binocular depth perception In this chapter, we have used examples from binocular depth perception to illustrate that given an appropriate set of experimental data, a physiologically plausible approach to the modeling of neural systems is both feasible and fruitful. The experimental and theoretical studies reviewed here suggest that although the disparity sensitivity in the visual cortex originates from left–right RF shifts in simple cells, it is at the level of complex cells that stimulus disparity is reliably coded in a distributed fashion. These studies suggest further that depth perception from vertical disparity and interocular time delay can be understood through vertical disparity and interocular time delay being treated as equivalent horizontal disparities by visual cortical cells. The models help increase our understanding of visual perception by providing uniﬁed accounts for some seemingly different physiological and perceptual observations and suggesting new experiments for further tests of these models. Indeed, without modeling, it would be difﬁcult to infer that random-dot stereograms could be effectively solved by a population of binocular complex cells without resorting to explicit matching, that the psychophysically observed disparity attraction/repulsion phenomenon under different stimulus conﬁgurations could be a direct consequence of the underlying binocular RF structure, or that different variations of the Pulfrich depth illusion could all be uniformly explained by the spatiotemporal response properties of binocular cells. Physiology-based computational models have the potential to synthesize a large body of existing experimental data into a coherent framework. They can also make speciﬁc, testable predictions and, indeed, several of our key predictions have been conﬁrmed by later experiments, as we have discussed above. Therefore, a close interplay between the experimental and computational approaches holds the best promise for resolving outstanding issues in stereovison (Qian, 1997; Chen et al., 2001), and for achieving a deeper understanding of neural information processing in general. Acknowledgments We would like to thank our collaborators Drs. Richard Andersen, Andrew Assee, Yuzhi Chen, Julián Fernéndez, Ralph Freeman, Nestor Matthews, Xin Meng, Samuel Mikalian, Brendon Watson, Peng Xu, and Yudong Zhu for their contributions to the work reviewed here. This work was supported by NIH grant #EY016270. References Adelson, E. H. and Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2: 284–299.

39

40

N. Qian and Y. Li Alonso, J. M. and Martinez, L. M. (1998). Functional connectivity between simple cells and complex cells in cat striate cortex. Nature Neurosci., 1: 395–403. Anzai, A., Ohzawa, I., and Freeman, R. D. (1997). Neural mechanisms underlying binocular fusion and stereopsis: position vs. phase. Proc. Natl. Acad. Sci. USA, 94: 5438–5443. Anzai, A., Ohzawa, I., and Freeman, R. D. (1999a). Neural mechanisms for encoding binocular disparity: receptive ﬁeld position vs. phase. J. Neurophysiol., 82: 874–890. Anzai, A., Ohzawa, I., and Freeman, R. D. (1999b). Neural mechanisms for processing binocular information: I. Simple cells. J. Neurophysiol., 82: 891–908. Anzai, A., Ohzawa, I., and Freeman, R. D. (1999c). Neural mechanisms for processing binocular information: II. Complex cells. J. Neurophysiol., 82: 909–924. Anzai, A., Ohzawa, I., and Freeman, R. D. (2001). Joint-encoding of motion and depth by visual cortical neurons: neural basis of the Pulfrich effect. Nature Neurosci., 4: 513–518. Arditi, A., Kaufman, L. and Movshon, J. A. (1981). A simple explanation of the induced size effect. Vis. Res., 21: 755–764. Assee, A. and Qian, N. (2007). Solving da Vinci stereopsis with depth-edge-selective v2 cells. Vis. Res., 47: 2585–2602. Backus, B. T., Banks, M. S., van Ee, R., and Crowell, J. A. (1999). Horizontal and vertical disparity, eye position, and stereoscopic slant perception. Vis. Res., 39: 1143–1170. Banks, M. S. and Backus, B. T. (1998). Extra-retinal and perspective cues cause the small range of the induced effect. Vis. Res., 38: 187–194. Bauer, R. and Dow, B. M. (1989). Complementary global maps for orientation coding in upper and lower layers of the monkey’s foveal striate cortex. Exp. Brain Res., 76: 503–509. Bauer, R., Dow, B. M., Synder, A. Z., and Vautin, R. G. (1983). Orientation shift between upper and lower layers in monkey visual cortex. Exp. Brain Res., 50: 133–145. Bishop, P. O. (1996). Stereoscopic depth perception and vertical disparity: neural mechanisms. Vis. Res., 36: 1969–1972. Bishop, P. O. and Pettigrew, J. D. (1986). Neural mechanisms of binocular vision. Vis. Res., 26: 1587–1600. Blake, R. and Wilson, H. R. (1991). Neural models of stereoscopic vision. Trends Neurosci., 14: 445–452. Bradley, D. C., Qian, N., and Andersen, R. A. (1995). Integration of motion and stereopsis in cortical area MT of the macaque. Nature, 373: 609–611. Burr, D. C. and Ross, J. (1979). How does binocular delay give information about depth? Vis. Res., 19: 523–532. Carney, T., Paradiso, M. A., and Freeman, R. D. (1989). A physiological correlate of the Pulfrich effect in cortical neurons of the cat. Vis. Res., 29: 155–165. Chen, Y. and Qian, N. (2004). A coarse-to-ﬁne disparity energy model with both phase-shift and position-shift receptive ﬁeld mechanisms. Neural Comput., 16: 1545–1577.

Physiologically based models of binocular depth perception Chen, Y., Wang, Y., and Qian, N. (2001). Modeling V1 disparity tuning to time-dependent stimuli. J. Neurophysiol., 86: 143–155. Cumming, B. G. and DeAngelis, G. C. (2001). The physiology of stereopsis. Annu. Rev. Neurosci., 24: 203–238. Cumming, B. G. and Parker, A. J. (1994). Binocular mechanisms for detecting motion-in-depth. Vis. Res., 34: 483–495. Cumming, B. G. and Parker, A. J. (1999). Binocular neurons in V1 of awake monkeys are selective for absolute, not relative, disparity. J. Neurosci., 19: 5602–5618. Cumming, B. G. and Parker, A. J. (2000). Local disparity not perceived depth is signaled by binocular neurons in cortical area V1 of the macaque. J. Neurosci., 20: 4758–4767. Cumming, B. G., Johnston, E. B., and Parker, A. J. (1991). Vertical disparities and perception of three-dimensional shape. Nature, 349: 411–414. Daugman, J. G. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical ﬁlters. J. Opt. Soc. Am. A, 2: 1160–1169. DeAngelis, G. C., Ohzawa, I., and Freeman, R. D. (1991). Depth is encoded in the visual cortex by a specialized receptive ﬁeld structure. Nature, 352: 156–159. Durand, J. B., Celebrini, S., and Trotter, Y. (2006). Neural bases of stereopsis across visual ﬁeld of the alert macaque monkey. Cereb. Cortex, 17: 1260–1273. Falk, D. S. (1980). Dynamic visual noise and the stereophenomenon: interocular time delays, depth and coherent velocities. Percept. Psychophys., 28: 19–27. Farell, B. (1998). Two-dimensional matches from one-dimensional stimulus components in human stereopsis. Nature, 395: 689–693. Felleman, D. J. and Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex, 1: 1–47. Fleet, D. J., Jepson, A. D., and Jenkin, M. (1991). Phase-based disparity measurement. Comput. Vis. Graphics Image Proc., 53: 198–210. Fleet, D. J., Wagner, H., and Heeger, D. J. (1996). Encoding of binocular disparity: energy models, position shifts and phase shifts. Vis. Res., 36: 1839–1858. Freeman, R. D. and Ohzawa, I. (1990). On the neurophysiological organization of binocular vision. Vis. Res., 30: 1661–1676. Gårding, J., Porrill, J., Mayhew, J. E. W., and Frisby, J. P. (1995). Stereopsis, vertical disparity and relief transformations. Vis. Res., 35: 703–722. Gillam, B. and Lawergren, B. (1983). The induced effect, vertical disparity, and stereoscopic theory. Percept. Psychophys., 34: 121–130. Haefner, R. M. and Cumming, B. G. (2008). Adaptation to natural binocular disparities in primate V1 explained by a generalized energy model. Neuron, 57: 147–158. Harris, J. M., McKee, S. P., and Watamaniuk, S. N. J. (1998). Visual search for motion-in-depth: stereomotion does not “pop out” from disparity noise. Nature Neurosci., 1: 165–168. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Vis. Neurosci., 9: 181–197. Howard, I. P. (2002). Basic Mechanisms. Vol. 1 of Seeing in Depth. Toronto: Porteous.

41

42

N. Qian and Y. Li Howard, I. P. and Kaneko, H. (1994). Relative shear disparity and the perception of surface inclination. Vis. Res., 34: 2505–2517. Howard, I. P. and Rogers, B. J. (1995). Binocular Vision and Stereopsis. New York: Oxford University Press. Hubel, D. H. and Wiesel, T. (1962). Receptive ﬁelds, binocular interaction, and functional architecture in the cat’s visual cortex. J. Physiol., 160: 106–154. Kaneko, H. and Howard, I. P. (1997). Spatial limitation of vertical-size disparity processing. Vis. Res., 37: 2871–2878. Koenderink, J. J. and van Doorn, A. J. (1976). Geometry of binocular vision and a model for stereopsis. Biol. Cybern., 21: 29–35. Lehky, S. R. and Sejnowski, T. J. (1990). Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. J. Neurosci., 10: 2281–2299. Leventhal, A. G. (1983). Relationship between preferred orientation and receptive ﬁeld position of neurons in cat striate cortex. J. Comp. Neurol., 220: 476–483. Liu, L., Stevenson, S. B., and Schor, C. W. (1994). A polar coordinate system for describing binocular disparity. Vis. Res., 34: 1205–1222. Livingstone, M. S. and Tsao, D. Y. (1999). Receptive ﬁelds of disparity-selective neurons in macaque striate cortex. Nature Neurosci., 2: 825–832. Mansﬁeld, R. J. W. and Daugman, J. D. (1978). Retinal mechanisms of visual latency. Vis. Res., 18: 1247–1260. Marcˇelja, S. (1980). Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. A, 70: 1297–1300. Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. San Francisco, CA: W. H. Freeman. Marr, D. and Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194: 283–287. Marr, D. and Poggio, T. (1979). A computational theory of human stereo vision. Proc. R. Soc. Lond. B, 204: 301–328. Maske, R., Yamane, S., and Bishop, P. O. (1984). Binocular simple cells for local stereopsis: comparison of receptive ﬁeld organizations for the two eyes. Vis. Res., 24: 1921–1929. Matthews, N., Meng, X., Xu, P., and Qian, N. (2003). A physiological theory of depth perception from vertical disparity. Vis. Res., 43: 85–99. Maunsell, J. H. R. and Van Essen, D. C. (1983). Functional properties of neurons in middle temporal visual area of the macaque monkey II. Binocular interactions and sensitivity to binocular disparity. J. Neurophysiol., 49: 1148–1167. Mayhew, J. E. W. (1982). The interpretation of stereo-disparity information: the computation of surface orientation and depth. Perception, 11: 387–403. Mayhew, J. E. W. and Longuet-Higgins, H. C. (1982). A computational model of binocular depth perception. Nature, 297: 376–379. McKee, S. P. and Levi, D. M. (1987). Dichoptic hyperacuity: the precision of nonius alignment. Vis. Res., 4: 1104–1108.

Physiologically based models of binocular depth perception McLean, J. and Palmer, L. A. (1989). Contribution of linear spatiotemporal receptive ﬁeld structure to velocity selectivity of simple cells in area 17 of cat. Vis. Res., 29: 675–679. Menz, M. D. and Freeman, R. D. (2003). Stereoscopic depth processing in the visual cortex: a coarse-to-ﬁne mechanism. Nature Neurosci., 6: 59–65. Mikaelian, S. and Qian, N. (2000). A physiologically-based explanation of disparity attraction and repulsion. Vis. Res., 40: 2999–3016. Morgan, M. J. (1979). Perception of continuity in stereoscopic motion: a temporal frequency analysis. Vis. Res., 19: 491–500. Morgan, M. J. and Fahle, M. (2000). Motion–stereo mechanisms sensitive to inter-ocular phase. Vis. Res., 40: 1667–1675. Morgan, M. J. and Thompson, P. (1975). Apparent motion and the Pulfrich effect. Perception, 4: 3–18. Nawrot, M. and Blake, R. (1989). Neural integration of information specifying structure from stereopsis and motion. Science, 244: 716–718. Neri, P., Bridge, H., and Heeger, D. J. (2004). Stereoscopic processing of absolute and relative disparity in human visual cortex. J. Neurophysiol., 92: 1880–1891. Nikara, T., Bishop, P. O., and Pettigrew, J. D. (1968). Analysis of retinal correspondence by studying receptive ﬁelds of binocular single units in cat striate cortex. Exp. Brain Res., 6: 353–372. Ogle, K. N. (1950). Researches in Binocular Vision. Philadelphia, PA: W. B. Saunders. Ohzawa, I., DeAngelis, G. C., and Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: neurons ideally suited as disparity detectors. Science, 249: 1037–1041. Ohzawa, I., DeAngelis, G. C., and Freeman, R. D. (1996). Encoding of binocular disparity by simple cells in the cat’s visual cortex. J. Neurophysiol., 75: 1779–1805. Ohzawa, I., DeAngelis, G. C., and Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 77: 2879–2909. Poggio, G. F. and Fischer, B. (1977). Binocular interaction and depth sensitivity in striate and prestriate cortex of behaving rhesus monkey. J. Neurophysiol., 40: 1392–1405. Poggio, G. F. and Poggio, T. (1984). The analysis of stereopsis. Annu. Rev. Neurosci., 7: 379–412. Poggio, G. F., Motter, B. C., Squatrito, S., and Trotter, Y. (1985). Responses of neurons in visual cortex (V1 and V2) of the alert macaque to dynamic random-dot stereograms. Vis. Res., 25: 397–406. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Comput., 6: 390–404. Qian, N. (1997). Binocular disparity and the perception of depth. Neuron, 18: 359–368. Qian, N. and Andersen, R. A. (1994). Transparent motion perception as detection of unbalanced motion signals II: physiology. J. Neurosci., 14: 7367–7380.

43

44

N. Qian and Y. Li Qian, N. and Andersen, R. A. (1997). A physiological model for motion-stereo integration and a uniﬁed explanation of Pulfrich-like phenomena. Vis. Res., 37: 1683–1698. Qian, N. and Freeman, R. D. (2009). Pulfrich phenomena are coded effectively by a joint motion–disparity process. J. Vis., 9: 1–16. Qian, N. and Mikaelian, S. (2000). Relationship between phase and energy methods for disparity computation. Neural Comput., 12: 279–292. Qian, N. and Zhu, Y. (1997). Physiological computation of binocular disparity. Vis. Res., 37: 1811–1827. Qian, N., Andersen, R. A., and Adelson, E. H. (1994a). Transparent motion perception as detection of unbalanced motion signals I: psychophysics. J. Neurosci., 14: 7357–7366. Qian, N., Andersen, R. A. and Adelson, E. H. (1994b). Transparent motion perception as detection of unbalanced motion signals III: modeling. J. Neurosci., 14: 7381–7392. Rashbass, C. and Westheimer, G. (1961). Disjunctive eye movements. J. Physiol., 159: 339–360. Read, J. C. A. and Cumming, B. G. (2005a). All Pulfrich-like illusions can be explained without joint encoding of motion and disparity. J. Vis., 5: 901–927. Read, J. C. A. and Cumming, B. G. (2005b). The stroboscopic Pulfrich effect is not evidence for the joint encoding of motion and depth. J. Vis., 5: 417–434. Read, J. C. A. and Cumming, B. G. (2007). Sensors for impossible stimuli may solve the stereo correspondence problem. Nature Neurosci., 10: 1322–1328. Regan, D. and Beverley, K. I. (1973). Disparity detectors in human depth perception: evidence for directional selectivity. Nature, 181: 877–879. Rogers, B. J. and Bradshaw, M. F. (1993). Vertical disparities, differential perspectives and binocular stereopsis. Nature, 361: 253–255. Rogers, B. J. and Koenderink, J. (1986). Monocular aniseikonia: a motion parallax analogue of the disparity-induced effect. Nature, 322: 62–63. Sanger, T. D. (1988). Stereo disparity computation using Gabor ﬁlters. Biol. Cybern., 59: 405–418. Skottun, B. C., DeValois, R. L., Grosof, D. H., Movshon, J. A., Albrecht, D. G., and Bonds, A. B. (1991). Classifying simple and complex cells on the basis of response modulation. Vis. Res., 31: 1079–1086. Smallman, H. S. and MacLeod, D. I. (1994). Size-disparity correlation in stereopsis at contrast threshold. J. Opt. Soc. Am. A, 11: 2169–2183. Smallman, H. S. and McKee, S. P. (1995). A contrast ratio constraint on stereo matching. Proc. R. Soc. Lond. B, 260: 265–271. Snowden, R. J., Treue, S., Erickson, R. E., and Andersen, R. A. (1991). The response of area MT and V1 neurons to transparent motion. J. Neurosci., 11: 2768–2785. Sobel, E. C. and Collett, T. S. (1991). Does vertical disparity scale the perception of stereoscopic depth? Proc. R. Soc. Lond. B, 244: 87–90. Tyler, C. W. (1974). Stereopsis in dynamic visual noise. Nature, 250: 781–782.

Physiologically based models of binocular depth perception Umeda, K., Tanabe, S., and Fujita, I. (2007). Representation of stereoscopic depth based on relative disparity in macaque area V4. J. Neurophysiol., 98: 241–252. Vidyasagar, T. R. and Henry, G. H. (1990). Relationship between preferred orientation and ordinal position in neurons of cat striate cortex. Vis. Neurosci., 5: 565–569. Wagner, H. and Frost, B. (1993). Disparity-sensitive cells in the owl have a characteristic disparity. Nature, 364: 796–798. Watson, A. B. and Ahumada, A. J. (1985). Model of human visual-motion sensing. J. Opt. Soc. Am. A, 2: 322–342. Westheimer, G. (1979). Cooperative neural processes involved in stereoscopic acuity. Exp. Brain Res., 36: 585–597. Westheimer, G. (1984). Sensitivity for vertical retinal image differences. Nature, 307: 632–634. Westheimer, G. (1986). Spatial interaction in the domain of disparity signals in human stereoscopic vision. J. Physiol., 370: 619–629. Westheimer, G. (1990). Detection of disparity motion by the human observer. Optom. Vis. Sci., 67: 627–630. Westheimer, G. and Levi, D. M. (1987). Depth attraction and repulsion of disparate foveal stimuli. Vis. Res., 27: 1361–1368. Westheimer, G. and Pettet, M. W. (1992). Detection and processing of vertical disparity by the human observer. Proc. R. Soc. Lond. B, 250: 243–247. Wist, E. R., Brandt, T., Diener, H. C., and Dichgans, J. (1977). Spatial frequency effect on the Pulfrich stereophenomenon. Vis. Res., 17: 371–397. Zhu, Y. and Qian, N. (1996). Binocular receptive ﬁelds, disparity tuning, and characteristic disparity. Neural Comput., 8: 1611–1641.

45

3

The inﬂuence of monocular regions on the binocular perception of spatial layout barbara gillam Early observations by Leonardo da Vinci (c. 1508) noted that the two eyes can see different parts of the background at the edges of occluding surfaces. This is illustrated in Leonardo’s drawing (Figure 3.1) and for two cases in Figure 3.2. In Figure 3.2a, an occluder hides the dotted region of background from both eyes, but there is a region on the right which only the right eye can see and a region on the left which only the left eye can see. Figure 3.2b shows a similar effect of looking through an aperture. In this case, a region on the left of the background seen through the aperture is visible to the right eye and vice versa. It is only since the early 1990s or so that there has been any serious investigation of the perceptual effects of such monocular occlusions, and a whole new set of binocular phenomena, involving the interaction of binocular and monocular elements in determining spatial layout, have been demonstrated and investigated (Harris and Wilcox, 2009). In this chapter, I shall concentrate on four novel phenomena that exemplify different ways in which unpaired regions inﬂuence binocular spatial layout. (1)

(2) (3)

Da Vinci stereopsis. This will be deﬁned as the perception of monocular targets in depth behind (or camouﬂaged against) a binocular surface according to constraints such as those shown in Figure 3.2. Monocular-gap stereopsis. In this case, monocular regions of background inﬂuence the perceived depth of binocular surfaces. Phantom stereopsis. This refers to the perception of an illusory surface in depth “accounting for” monocular regions in a binocular display.

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

46

Inﬂuence of monocular regions on the binocular perception of spatial layout

Figure 3.1 Adapted from Leonardo’s drawing. Two eyes on the right looking through an aperture to a surface on the left.

Visible to LE only

Visible to RE only

LE

RE (a)

Vi sible to LE only

Vi sible to RE only

RE

LE (b)

Figure 3.2 Different views to different eyes. (a) Two eyes looking at a background behind an occluding surface. The right eye (RE) sees more of the background on the right and the left eye (LE) sees more on the left. (b) Two eyes looking through an aperture at a background surface. The right eye sees more of the background on the left, and vice versa.

(4)

Ambiguous stereopsis. This refers to the interesting situation in which there is not an explicit monocular region, but the visual system treats a disparate contour as either slanted or occluded depending on the context. In the latter case, the extra width in one eye’s image is attributed to differential occlusion in the two eyes rather than to slant.

47

48

B. Gillam A major theme of this chapter is the relationship between monocularocclusion-based binocular depth perception and regular disparity-deﬁned depth perception. This issue involves three questions: (1)

(2)

(3)

3.1

To what extent is depth caused by monocular occlusion explained directly by disparity-based depth? In other words, regular disparity is present but overlooked. For an example in which this turned out to be the case, see a series of articles on the phantom rectangle by Liu et al. (1994, 1997), Gillam (1995), and Gillam and Nakayama (1999). Given that regular disparity-based depth is ruled out, to what extent does depth based on monocular occlusion resemble disparity-based depth in its effects? For example, how quantitative and precise is it? How is it constrained? How does monocular occlusion depth interact with depth based on regular disparity? Da Vinci stereopsis

The term “da Vinci stereopsis” was introduced by Nakayama and Shimojo (1990) to describe the depth seen for the monocular bar in the stereogram of Figure 3.3. They found that a monocular bar, for example to the right of the right eye’s image of a binocular surface, appeared behind it to a degree that is predicted by the geometry of a “minimum-depth constraint,” a useful concept they introduced. The minimum-depth constraint is illustrated in Figure 3.4 for a monocular bar on the left of the left eye’s view. To be seen only by the left eye, the bar would have to be behind the binocular surface to a degree determined by its separation from that surface. This represents the minimum

LE

RE

Figure 3.3 Nakayama and Shimojo’s stimuli for da Vinci stereopsis. Uncrossed fusion.

Inﬂuence of monocular regions on the binocular perception of spatial layout

Minimum depth

Binocular surface

Monocular surface

LE

RE

Figure 3.4 Minimum-depth constraint for a monocular surface to the left of a binocular surface. The depth of the inner edge, as well as the outer edge of the monocular surface, is constrained.

depth satisfying the geometry, although the depth could be greater and still be within the monocular occlusion zone on that side. Note that the position of the inner edge of the bar is also constrained by the angular separation between it and the binocular surface. It appears, however, that the stimulus used by Nakayama and Shimojo (1990), which consisted of long vertical contours for both the monocular bar and the binocular surface, contains the potential for a form of regular, disparity based stereopsis. This is an instance of Panum’s limiting case (Panum, 1858) in which a single contour in one eye can fuse with several contours in the other eye. Häkkinen and Nyman (1997) and Gillam et al. (2003) provided experimental support for this view, each group showing that a monocular target that was within the monocular constraint zone of a binocular surface but not fusible with its edge failed to produce quantitative depth, although it did look somewhat behind the binocular surface (Figure 3.5). Assee and Qian (2007) have modeled the data of Nakayama and Shimojo (1990) as a form of Panum’s limiting case. Is quantitative depth for a monocular da Vinci target that is not attributable to regular stereopsis possible? For an indication that it is, we must go back to the demonstrations of von Szily (1921), whose work was almost unknown until recently. Figure 3.6 shows a von Szily stereogram. When the right pair is cross-fused or the left pair is fused uncrossed, the monocular tab appears to be

49

50

B. Gillam

X

U

Figure 3.5 Stimulus used by Gillam et al. (2003). The dot appears behind, but the depth in not quantitative.

Figure 3.6 Stimulus from von Szily (1921) with attached monocular tabs. Fuse either the left or the right pair.

behind. When the reverse occurs, the tab appears in front. A strong indication that fusional stereopsis between the edge of the binocular surface and the edge of the tab is not responsible for the perceived depth is that the tab appears in depth but in the frontal plane, whereas fusional depth should cause it to slant. In the von Szily ﬁgures, unlike those of Nakayama and Shimojo (1990) and Gillam et al. (2003), the tabs are attached to the binocular surface. A consequence of this is that only the position of the outer side of the tab is subject to the minimumdepth constraint (Figure 3.7). The other (inner) side could be anywhere behind the binocular surface. (As Figure 3.4 shows, a detached bar has two constraints, one for each side.) Note also that in the in-front case the unseen partner of the monocular tab must be treated as camouﬂaged against the binocular

Inﬂuence of monocular regions on the binocular perception of spatial layout

Minimum depth Occluder

LE

RE

Figure 3.7 Minimum-depth constraint for a tab attached to a binocular surface. Note that only the left side of the monocular tab is constrained.

surface rather than occluded by it. Nakayama and Shimojo (1990) did not obtain depth in a camouﬂage conﬁguration, although it was demonstrated by Kaye (1978). Figure 3.8a is a demonstration of a series of attached monocular tabs (extrusions) in near and far da Vinci conﬁgurations next to a binocular surface illustrating occlusion (both tab colours) and camouﬂage (black tabs only). For red tabs, camouﬂage should not be possible (red cannot be camouﬂaged against black) and, indeed, the near responses, which depend on camouﬂage, occur readily with the black tabs but seem to be lacking for the red tabs. Figure 3.8b shows the case of monocular “intrusions.” When the intrusion is on the right side of the binocular surface for the left eye or vice versa, the intrusion stimulus resembles the situation illustrated in Leonardo’s original diagram (Figure 3.1). The binocular surface is seen through an aperture, and the tab is seen behind the aperture on the black surface. As with the extrusions, the far conditions depend on occlusion of one eye’s view of the tab, so its color does not matter, whereas, for the near condition, the tab depth depends on camouﬂage against the white background and the tab cannot validly be red. As expected, the depth effect seems attenuated in this case. An empirical study of da Vinci depth using the intrusion type of monocular element was carried out by Cook and Gillam (2004). They compared depth for attached tabs and detached bars with the same edge displacement from the binocular surface (Figure 3.9). In each case, observers had to set a binocular

51

52

B. Gillam

(a)

(b)

Figure 3.8 Monocular extrusions and intrusions. (a) Monocular extrusions. Cross-fusion of the left pairs causes the tabs to appear behind. Cross-fusion of the right pairs causes the black tab but not the red tab to appear in front. (b) Monocular intrusions. Cross-fusion of the right pairs causes both tabs to appear behind. Cross-fusion of the left pairs causes the white tab but not the red tab to appear in front. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

Intrusion tabs

Intrusion bars

Figure 3.9 Four of the stereograms used in Cook and Gillam’s (2004) experiment using monocular attached tabs or detached bars.

probe to the apparent depth of the edge of the bar or tab. Figure 3.10 shows data for seven individual observers. All seven showed a lack of quantitative depth for the bar and a clear quantitative effect, following the minimum-depth constraint, for the equivalent tab. However, whereas ﬁve observers (Type 1) were

Inﬂuence of monocular regions on the binocular perception of spatial layout

Figure 3.10 Data for Cook and Gillam’s (2004) experiment. Gray-shaded areas represent monocular occlusion zones and their edges represent minimum-depth constraints. Intruding tabs but not detached bars (see Figure 3.9) follow this constraint.

able to see both a case where the tab appeared nearer than the dumbbell and a case where it was seen through the dumbbell as an aperture, two observers (Type 2) could not see depth in the aperture case (upper left stereogram in Figure 3.9). Cook and Gillam (2004) proposed that the critical factor necessary to obtain quantitative depth in a da Vinci situation is the presence of cyclopean T-junctions, which are present only for the attached tabs. This term refers to T-junctions formed when left- and right-eye stereoscopic views are superimposed. This is illustrated in Figure 3.11. These seem to suppress a fusional depth response. This view is supported by a control experiment made possible by the dumbbell conﬁguration. Cook and Gillam showed that fusional depth does not

53

54

B. Gillam

Cyclopean view

Cyclopean T-junctions Depth determined by maximum disparity

Suppression of fusional stereo

Figure 3.11 A fused monocular intrusion ﬁgure results in cyclopean T-junctions.

Top probe position Center probe position Bottom probe position

Top probe position Center probe position Bottom probe position Figure 3.12 Stimuli for Cook and Gillam’s (2004) control experiment. (See text.)

occur when cyclopean T-junctions are present but emerges for the same contours when they are absent. The cyclopean T-junctions were eliminated by using only the center of the dumbbell ﬁgure (Figure 3.12). Three depth probes were used as shown to measure depth along the vertical edges of the intrusion and

Inﬂuence of monocular regions on the binocular perception of spatial layout Type 1

Type 2 Intrusion

Control

Intrusion

MC –40 Far

Far

–40

0 Probe disparity (min arc)

40 CH –40 –20

Near

0 20

20 40 PC –40 –20 0

Near

20

Probe position

Base

Center

Top

Top

Base

Center

Top

Base

Center

40 Top

40

20

Base

Near

0

Center

Far

–20

Far

Near

–20

Probe disparity (min arc)

Control RE

Probe position

Intrusion angle (min) 10 25 Temporal (near) Nasal (far) Figure 3.13 Data for Cook and Gillam’s (2004) control experiment showing that variation in depth across probes (fusional stereo) occurs for the control condition only.

the control. Disparity-based depth should result in a greater depth for the middle probe than for the upper and lower probes, following variations in disparity. Occlusion-based depth, on the other hand, should result in a constant depth for the three probes. Figure 3.13 shows the results. It is clear that the depth variation is present only for the control ﬁgure and not for the full ﬁgure with the cyclopean T-junctions.

55

56

B. Gillam To summarize our conclusions concerning da Vinci stereopsis: (1)

(2) (3)

3.2

It can be quantitative and precise if a monocular element is attached to the binocular surface. If the monocular element is detached, it can appear behind but its depth is nonquantitative. Cyclopean T-junctions seem to initiate da Vinci type depth for a monocular tab and to inhibit regular disparity-based depth. To be valid for near depth (camouﬂage conﬁguration) an extrusion needs to match the luminance and color of the binocular surface and an intrusion needs to match the luminance and color of the background. It appears that near da Vinci depth effects require these matchings (Figure 3.8). This issue needs quantitative investigation, but the demonstrations suggest a new form of binocular matching – of surface properties rather than of contours. Monocular-gap stereopsis

Figure 3.14 shows a bird’s eye view of two black rectangles at different depths viewed in front of a white background. The inner ends of the black rectangles are abutting in the left-eye view, whereas the right eye can see between them. A stereogram of this situation is shown beneath the layout represented. This is an interesting situation because only the outer vertical edges in the right eye’s view have matches in the left eye’s view. Based purely on

White background Black Black

LE

RE

Images

Figure 3.14 The basic stimulus for Gillam et al.’s (1999) monocular-gap stereopsis.

Inﬂuence of monocular regions on the binocular perception of spatial layout LE

RE

White background

Black surface Black surface

LE

RE

Figure 3.15 In the right diagram, the dotted line represents a hypothetical division of the solid rectangle in the left eye’s view. Binocular geometry then predicts the perception of frontal-plane rectangles at different depths.

regular stereoscopic principles, one might expect to see a slanted surface with a rivalrous inner whitish patch. However, considering all the information as informative about spatial layout, it becomes clear that a monocular region of background could only arise from two surfaces, separated in signed depth, with the more distant rectangle on the side of the eye with the gap. Indeed, this is what is seen. If the rectangles are assumed to be abutting in the image of the eye without the gap, the gap becomes equivalent to a disparity, and the surfaces should appear to have depth accordingly. The geometry is shown in Figure 3.15. Gillam et al. (1999) measured the depth for this stimulus and found that it could be quantitatively predicted by treating the gap as a disparity. Grove et al. (2002) found that, for the depth to be optimal, the monocular gap had to be of the same color and texture as the background. This is clearly a form of binocular depth due to the application of occlusion geometry. It cannot be attributed to regular disparity-based stereopsis, since there is no disparity at the gap where depth is seen. We asked two questions: (1)

(2)

What is the nature of the depth signal? Is it generated at the gap, or is it a depth signal given by the disparity at the edges of the conﬁguration and merely displaced to the gap? What information is the depth signal based on, and what constraints are applied?

57

58

B. Gillam 3.2.1

Nature of depth signal

Pianta and Gillam (2003a) investigated the nature of the depth signal in two experiments using the stimuli shown in Figure 3.16. In the ﬁrst experiment, we compared depth discrimination thresholds as a function of exposure duration for a monocular-gap stimulus (Figure 3.16b), an equivalent binocular-gap stimulus with a disparity equal to the gap (Figure 3.16a), and a third stimulus (Figure 3.16c) with no gaps but with an edge disparity equal to that of the other two. The results are shown in Figure 3.17. The depth thresholds for monoculargap and binocular-gap stimuli were very similar for all three observers, whereas the threshold for detecting the edge disparity (slant) was greatly raised, despite it sharing the (only) disparity present in the monocular-gap stimulus (at the edges). This ﬁnding refutes the idea that the depth at the monocular gap is merely the depth signal at the edges displaced to the gap. This idea is embodied in the Grossberg and Howe (2003) model of this phenomenon. We conclude that the presence of a monocular gap not only locates depth away from the location of the disparity but also greatly improves the depth signal. Similar mechanisms appear to mediate depth discrimination for monocular and normal stereo stimuli in this case. a

b c

a

a

c

a

(a) Stereo gap a

b

a

2a

(b) Monoculargap stereopsis a

b

a

2a

(c) Stereoscopic slant Figure 3.16 Stimuli used in the experiments of Pianta and Gillam (2003a). In all cases, the edge disparity equals b.

Inﬂuence of monocular regions on the binocular perception of spatial layout Monocular gap Stereo gap Slant

1.5

log threshold depth (arcmin)

0.5

–0.5 1.5

MC

0.5

–0.5 1.5

RGB

0.5

–0.5

MJP 10

100 1000 Duration (ms)

10000

Figure 3.17 Thresholds for detecting depth for the three stimulus types shown in Figure 3.16. (Three observers.)

To test this last assertion, our second experiment compared depth aftereffect transfer to the regular stereo stimulus shown in Figure 3.16a using all three stimuli shown in Figure 3.16 as inducing stimuli. We used a Bayesian method with multiple staircases (Kontsevich and Tyler, 1999) to track the adaptation over time. The results are shown in Figure 3.18. The aftereffect was identical for monocular-gap and stereo-gap stimuli, but there was no aftereffect for the stimulus with edge disparity only. Adaptation therefore cannot be attributed to the disparity per se. It appears to affect a depth signal that is common to unpaired and normal stereo stimuli but not the edge-based (slant) stimulus, despite its common edge disparity with the monocular-gap stimulus. 3.2.2

What constraints are used?

The monocular gap constrains the response to include a signed depth step at the gap. The edge disparity constrains the depth at the outer edges of the ﬁgure. However, any monocular-gap stimulus is ambiguous. In Figure 3.15, for example, the rectangles seen in depth could be slanted or even differently

59

B. Gillam

1.5

1.5

MC

0.5

PSE (arcmin)

60

–0.5

10

20

30

RGB

0.5 10

–0.5

20

30

Monocular gap

1.5

Stereo gap MJP

0.5

–0.5

10

20

Slant

30

Time (s) Figure 3.18 Decay of aftereffects for the three stimuli shown in Figure 3.16. The two gap stimuli (monocular and stereo) gave very similar results, while the slant stimulus had no aftereffect (Pianta and Gillam, 2003a).

slanted and not violate the geometry but the depth accompanying a slanted solution would have to be larger than that predicted by the frontal plane solution. The ﬁnding that the frontal plane solution rather than one of the possible slanted solutions could be due to either a minimum-depth constraint or a minimum-slant constraint. In our next experiment (Pianta and Gillam, 2003b), we examined this question, exploring the roles of edge disparity and gap size independently in determining depth and attempting to determine what constraints are imposed. We used three types of stimuli, shown at the bottom of Figure 3.19. They varied in the width of the solid rectangle. Figure 3.19a shows the monocular-gap stimulus used in our previous experiments, in which the gap width and edge disparity were the same. The stimulus shown in Figure 3.19b has no edge disparity. The only depth resolutions compatible with the geometry for this stimulus are slanted. This stimulus has no disparity anywhere. The ﬁnal stimulus (Figure 3.19c) was critical in revealing constraints. It had an edge disparity twice as great as the gap for each gap size. This allowed separation of the predictions of the minimum-depth and minimum-slant constraints. The minimum-depth constraint would predict slanted rectangles with the same depth as in Figure 3.19a, whereas the minimum-slant constraint would predict frontal-plane rectangles with twice the depth of that in Figure 3.19a. Both of these solutions are compatible with the geometry. The results are shown in Figure 3.20, where parts (a),

Inﬂuence of monocular regions on the binocular perception of spatial layout

LE

RE

LE

RE

LE

RE

2a

2a+b

2a-b

(a)

(b)

(c)

Figure 3.19 The types of stimuli used in the depth-matching experiment of Pianta and Gillam (2003b). The width of the solid rectangle is given with each pair of stimuli: a refers to the width of each rectangle in the eye with a gap, and b refers to the width of the gap. Possible resolutions are shown by the dotted lines in each case.

(b), and (c) correspond to the three stimuli shown in Figure 3.19. Figure 3.20a conﬁrmed our previous data, with quantitative depth predicted by both the minimum depth and the minimum slant constraints. Figure 3.20b, despite the complete absence of disparity, showed increasing depth as a function of the gap but for two observers the depth was less than that predicted by the minimumdepth constraint. For Figure 3.20c, the depth clearly followed the minimumslant constraint in that the depth was twice as great as the prediction of the minimum-depth constraint (which would have required a slanted solution). It is interesting that the minimum-slant solution is favored even though it implies overlapping rather than abutting images in the eye with the solid rectangle. This experiment shows that depth is always seen when there is a monocular gap. When the geometry would require a slanted solution, however, the geometry is not fully realized perceptually. This could be the result of conﬂicting perspective. When the edge disparity is greater than the gap width, a frontal-plane solution is always possible and is strongly preferred over a slanted

61

B. Gillam Edge disparity equals gap

No edge disparity

40

Edge disparity twice gap 40

40

2a

Matched depth (arcmin)

62

2a–b

2a+b

20

–20

20

20

20

–20

–40

–20

20

–20

20

–20

–20

–40

–40

Gap width b (arcmin) (a)

(b)

(c)

Figure 3.20 Results of the Pianta and Gillam (2003b) experiment measuring perceived depth in monocular-gap stereograms, where the width of the solid rectangle was varied. (See text and Figure 3.19 for details.)

solution with a smaller depth. We conclude that a minimum-slant constraint is applied where possible and that gap size and edge disparity jointly determine the highly metric depth seen at the gap.

3.3

Phantom stereopsis

This phenomenon was discovered by Nakayama and Shimojo (1990). When a few monocular dots were placed next to a set of sparsely spaced binocular dots on the valid side for a monocular occlusion zone, a ghostly contour was seen at the edge of the binocular dots, apparently “accounting for” the monocularity of the extra dots. Anderson (1994) showed that vertical lines that are different in length in the two eyes produce oblique phantom contours that account for the vertical difference. Gillam and Nakayama (1999) devised a very simple stimulus completely devoid of binocular disparity, consisting of a pair of identical lines in the two eyes but with a middle section of the left line missing for the right eye and a middle section of the right line missing for the left eye. This gives rise to a phantom rectangle in front of the middle section of the lines,

Inﬂuence of monocular regions on the binocular perception of spatial layout Far

Near

(a) R

L

LE

RE (b)

Figure 3.21 (a) Rectangle in depth hiding the center of the left line for the right eye, and vice versa. (b) Geometry of minimum-depth constraint for this stimulus.

accounting for the missing middle sections in each eye as shown in Figure 3.21a. Figure 3.21b shows the geometry of the minimum-depth constraint for this stimulus. The wider the lines, the greater the depth of the rectangle would have to be to hide each of the middle sections of the lines from one eye only. Figure 3.22 is a stereogram demonstrating the phantom rectangle for uncrossed and crossed fusion. Gillam and Nakayama (1999) showed that the depth of the rectangle is quantitatively related to the width of the lines, but they and several other investigators (Grove et al., 2002; Mitsudo et al., 2005, 2006) found that the perceived depth is greater than the minimum-depth constraint. Furthermore, Mitsudo et al. (2005, 2006) found that the phantom rectangle but not its inverse (the same stimulus with eyes switched) is more detectable in disparity noise, has a lower contrast detection threshold, and supports better parallel visual-search performance than does an equivalent disparity-deﬁned stimulus. They attributed these results to the greater apparent depth of this stimulus and argued that a long-range surface process based on unpaired regions is resolved

63

64

B. Gillam

U

X Figure 3.22 Left and right images for the situation shown in Figure 3.21. For crossed fusion, use the lower images.

Figure 3.23 The regular dotted texture in the background is remapped to show capture by the apparent depth of the phantom rectangle. Left pair for crossed and right pair for uncrossed fusion. Reprinted from Häkkinen and Nyman (2001).

early in the visual system. The phantom rectangle behaves like disparity-deﬁned stereopsis in several other ways. Häkkinen and Nyman (2001) showed that it supports visual capture (Figure 3.23). It also shows scaling with changes in vergence that are very similar to those found with an equivalent disparity-deﬁned rectangle (Kuroki and Nakamizo, 2006). These many similarities with disparity-based stereopsis for a stimulus that has no disparity anywhere are particularly challenging to the view that depth based on monocular regions is a distinct process from disparity-deﬁned depth.

Inﬂuence of monocular regions on the binocular perception of spatial layout 3.4

Ambiguous stereopsis

Unlike the phantom-stereopsis stimulus, which has no disparity at all, the ﬁnal case to be considered has horizontal disparity that can be shown to support regular stereopsis perfectly well. Because of the context, however, the visual system prefers instead to attribute the horizontal difference or differences in the image not to slant but to differential occlusion in the two eyes. Figure 3.24 shows how the same horizontal disparity could be produced by a slant or an occlusion. Häkkinen and Nyman (1997) were the ﬁrst to investigate responses to this ambiguity, showing that when a taller binocular surface was placed next to a disparate rectangle in a valid position for its partial occlusion, the perceived slant of the rectangle was signiﬁcantly attenuated. Gillam and Grove (2004) used sets of horizontal lines aligned on one side (Figure 3.25.) The lines were made longer in one eye’s view either by horizontally magnifying the set of lines in that eye (not shown) or by adding a constant length to that set (middle set in Figure 3.25). In the former case, uniform slant was seen, consistent with the uniform magniﬁcation of the lines in one eye’s view. In the latter case, a phantom occluder appeared in depth, accounting for the extra constant length in one eye’s view. This only occurred, however, when the longer lines were on the valid side (right in the right eye or left in the left eye) as in the left pair shown in Figure 3.25 viewed with crossed fusion. If the views were switched between eyes, the lines all appeared to have different slants, since an extra constant length magniﬁed each line differently in the two

Figure 3.24 The same horizontal disparity produced by a slant and an occlusion.

65

66

B. Gillam

Figure 3.25 Cross-fuse left pair for valid occlusion with a phantom occluder, and cross-fuse right pair for invalid occlusion. Vice versa for uncrossed fusion.

eyes. Importantly, the latter observation shows that disparity-based stereo was available as a response in either the valid or the invalid case. In the invalid case, seeing local stereo slants was the only ecologically valid response. In the valid case, however, the global resolution of seeing an occluder represented a single solution for all lines and was preferred to a series of different local resolutions consistent with conventional stereopsis. This indicates that global considerations are powerful in stereopsis and that conventional disparity-based stereopsis is not necessarily the paramount depth response. Figure 3.26 shows some even more remarkable examples of perceived phantom occlusion when conventional stereopsis was available on a local basis (as shown by the response to the invalid cases). In Figure 3.26a, crossed fusion of the left pair demonstrates a phantom occluder sloping in depth. Crossed fusion of the right pair (invalid for occlusion) shows lines at different local slants. Figure 3.26b (left pair, crossed fusion) shows a smoothly curving occluder in depth. The invalid case again shows local slants. 3.5

Conclusions

Consideration of the binocular layout of overlapping surfaces and of the variety of ways monocular regions contribute to perceiving this layout seems to require a new approach to binocular vision. Certainly, a division of binocular spatial-layout perception into regular stereopsis based on the disparity of matched images on the one hand and depth from unpaired (monocular) regions on the other is not a tenable position, for the following reasons: (1)

Global context may cause a given horizontal disparity to be treated as a uniocular occlusion, with this response replacing the normal stereo response to horizontal disparity. In the von Szily stereograms (von Szily,

Inﬂuence of monocular regions on the binocular perception of spatial layout

(a)

(b) Figure 3.26 In (a), a diagonal occluder sloping in depth is perceived. In (b), the occluder is perceived to curve in depth. Cross-fuse left pair in each case for valid occlusion, and cross-fuse right pair for the invalid case. (See text for details.)

(2)

1921), the Cook and Gillam (2004) intrusion stereograms, and the occlusion stimulus of Häkkinen and Nyman (1997), the context favoring an occlusion interpretation seems to consist of cyclopean T-junctions. In ambiguous stereopsis (Gillam and Grove, 2004), the context consists of multiple disparate lines, all consistent with a single global occlusion resolution. When the eyes are switched, making the occlusion interpretation invalid, the same disparities are resolved locally according to regular stereoscopic principles. Disparity-based stereopsis can cooperate with monocular details to determine depth magnitude (monocular-gap stereopsis). This cooperation results in highly metric depth (at a location without matched features), with detection thresholds similar to those for fully disparitydeﬁned depth, and cross-adaptation with it. The presence of a monocular gap is critical to obtaining these depth effects.

67

68

B. Gillam (3)

The phantom rectangle, a binocular/monocular stimulus which has no disparity at all, elicits depth that closely resembles that of disparitybased stereopsis, facilitating search and showing the same depth scaling and capture.

The various forms of binocular depth perception we have considered vary from one another in several ways. They vary in the constraints present or imposed in each case. These constraints are not always understood. What is the basis of the imposition of the minimum-depth constraint or the minimumslant constraint? Why, for example, does the phantom rectangle have a greater depth than the minimum-depth constraint? Why do separated bars in monocular occlusion zones not follow the constraint? We have argued that this latter ﬁnding is related to a lack of contextual support from cyclopean T-junctions. Such support can also come from other monocular regions supporting the same occluding surface (the phantom rectangle) or from multiple disparities all consistent with the same uniocular occlusion (ambiguous stereopsis). Support can also come from binocular disparity elsewhere in the image (monocular-gap stereopsis). Despite their differences, we regard all the phenomena we have considered here to be aspects of a complex binocular surface recovery process. We would argue further that disparity-based stereopsis is not qualitatively different but part of the same process, differing in being more highly constrained and less in need of contextual support. The processes underlying the surprising range of binocular information we can respond to constitute fertile ground for further research, modeling, and physiological exploration.

References Anderson, B. L. (1994). The role of partial occlusion in stereopsis. Nature, 367: 365–368. Assee, A. and Qian, N. (2007). Solving da Vinci stereopsis with depth-edge-selective V2 cells. Vis. Res., 47: 2585–2602. Cook, M. and Gillam, B. (2004). Depth of monocular elements in a binocular scene: the conditions for da Vinci stereopsis. J. Exp. Psychol.: Hum. Percept. Perf., 30: 92–103. da Vinci Leonardo (c. 1508). Manuscript D. Bibliotheque Institut de France. Figure reproduced in Strong, D. (1979). Leonardo on the Eye. New York: Garland. Gillam, B. (1995). Matching needed for stereopsis. Nature, 373: 202–203. Gillam, B. and Grove, P. M. (2004). Slant or occlusion: global factors resolve stereoscopic ambiguity in sets of horizontal lines. Vis. Res., 44: 2359–2366. Gillam, B. and Nakayama, K. (1999) Quantitative depth for a phantom surface can be based on occlusion cues alone. Vis. Res., 39: 109–112.

Inﬂuence of monocular regions on the binocular perception of spatial layout Gillam, B., Blackburn, S., and Nakayama, K. (1999). Stereopsis based on monocular gaps: metrical encoding of depth and slant without matching contours. Vis. Res., 39: 493–502. Gillam, B., Cook, M., and Blackburn, S. (2003). Monocular discs in the occlusion zones of binocular surfaces do not have quantitative depth – a comparison with Panum’s limiting case. Perception, 32: 1009–1019. Grove, P. M., Gillam, B., and Ono, H. (2002). Content and context of monocular regions determine perceived depth in random dot, unpaired background and phantom stereograms. Vis. Res., 42: 1859–1870. Grossberg, S. and Howe, P. D. L. (2003). A laminar cortical model of stereopsis and three-dimensional surface perception. Vis. Res., 43: 801–829. Häkkinen, J. and Nyman, G. (1997) Occlusion constraints and stereoscopic slant. Perception, 26: 29–38. Häkkinen, J. and Nyman, G. (2001). Phantom surface captures stereopsis. Vis. Res., 41: 187–199. Harris, J. M. and Wilcox, L. M. (2009). The role of monocularly visible regions in binocular scenes. Vis. Res., 49: 2666–2685. Kaye, M. (1978). Stereopsis without binocular correlation. Vis. Res., 18: 1013–1022. Kontsevich, L. L. and Tyler, C. W. (1999). Bayesian adaptive estimation of psychometric slope and threshold. Vis. Res., 39: 2729–2737. Kuroki, D. and Nakamizo, S. (2006). Depth scaling in phantom and monocular gap stereograms using absolute distance information. Vis. Res., 46: 4206–4216. Liu, L., Stevenson, S. B., and Schor, C. M. (1994). Quantitative stereoscopic depth without binocular correspondence. Nature, 367: 66–69. Liu, L., Stevenson, S. B., and Schor, C. M. (1997). Binocular matching of dissimilar features in phantom stereopsis. Vis. Res., 37: 633–644. Mitsudo, H., Nakamizo, S., and Ono, H. (2005). Greater depth seen with phantom stereopsis is coded at the early stages of visual processing. Vis. Res., 45: 1365–1374. Mitsudo, H., Nakamizo, S., and Ono, H. (2006). A long-distance stereoscopic detector for partially occluding surfaces. Vis. Res., 46: 1180–1186. Nakayama, K. and Shimojo, S. (1990). Da Vinci stereopsis: depth and subjective occluding contours from unpaired image points. Vis. Res., 30: 1811–1825. Panum, P. L. (1858). Physiologische Untersuchungen über das Sehen mit zwei Augen. Keil: Schwerssche Buchhandlungen. Pianta, M. J. and Gillam, B. J. (2003a). Monocular gap stereopsis: manipulation of the outer edge disparity and the shape of the gap. Vis. Res., 43: 1937–1950. Pianta, M. J. and Gillam, B. (2003b). Paired and unpaired features can be equally effective in human depth perception. Vis. Res., 43: 1–6. Szily, A. von (1921). Stereoscopische Versuche mit Schattenvissen. Von Graefes Arch. Ophthalmol., 105: 964–972. (W. Ehrenstein and B. J. Gillam, trans., Stereoscopic experiments with silhouettes. Perception, 1998, 27:1407–1416).

69

4

Information, illusion, and constancy in telestereoscopic viewing brian rogers

4.1

The concept of illusion

When is an illusion not an illusion? If the deﬁnition of an illusion is something like “a lack of correspondence between the ‘input’ and what we perceive,” then it necessarily depends on how we deﬁne the “input.” Clearly, this shouldn’t refer to the characteristics of the proximal image, because if it did, we would have to treat all perceptual constancies as illusions since there is invariably a lack of correspondence between the retinal size, shape, or wavelength and what we see. Helmholtz (1910) was well aware of this when he wrote: “I am myself disposed to think that neither the size, form and position of the real retina nor the distortions of the image projected on it matter at all, as long as the image is sharply deﬁned all over.” If the “input” is not to be deﬁned in terms of the proximal image, then perhaps it should be deﬁned in terms of the properties of the distal image – the real-world situation that creates a particular retinal input. This sounds better, but a problem arises when the patterns of light reaching the eyes have not been created by the real world but instead by some device that precisely mimics those patterns of light. For example, our perception of structure-from-motion that has been created by patterns of motion on a ﬂat computer screen (a kinetic depth effect, or KDE) would have to regarded as an illusion, as would the perception of depth created by a pair of 2D images in a stereoscope. If the patterns of light reaching the eye(s) are “projectively equivalent” (Howard and Rogers, 2002) over space and time to those created by a particular real-world scenario, then no seeing machine, artiﬁcial or biological,

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

70

Information, illusion, and constancy in telestereoscopic viewing could ever tell the difference, and hence it makes no sense to call our perception “illusory.” We can call all these equivalent stimulus situations “facsimiles,” but it doesn’t seem appropriate to call what we perceive under these circumstances “illusory” (Rogers, 2010). In other words, the reference from which we assess what is illusory and what is not (Wade, 2005) cannot depend on how the particular pattern of light reaching the eye(s) is created; instead, it has to depend on the informational content of that pattern of light. If viewing a structure-from-motion display or a pair of stereoscopic images should not be regarded as illusory, what about situations such as the Ames room? What we see – a normal rectangular room – does not correspond to the physical structure of the room, which is trapezoidal, and therefore it sounds as if our perception ought to be regarded as illusory. This was the view expressed by Glennerster et al. (2006). They wrote: “In that illusion, the two sides of the room have different scales but appear similar because observers fail to notice the gradual change in scale across the spatial extent of the room.” However, if the Ames room is built with care, there is no information about the “change in scale across the extent of the room.”1 On the contrary, there is abundant information from the perspective features drawn on the walls to tell us that the room is rectangular, and what we perceive is consistent with this information. As a consequence, it is not clear why our perception of the Ames room should be regarded as illusory (Rogers, 2004). One possibility is that the situation becomes illusory when we add people to the Ames room, because we see the people standing in the two corners as being different in size (Figure 4.1). In these circumstances, there is a conﬂict between the perspective information telling us that the room is rectangular and the familiar-size-of-object information suggesting that the people must be at different distances. It is important to note, however, that the familiar-sizeof-object cues do not provide any contradictory information about the room’s 3D shape; they tell us only that the distances to the two people are not the same. These situations, where there is inconsistent or conﬂicting information, provide us with a useful way of ﬁnding out the relative strength or importance of different sources of information (Bülthoff and Mallot, 1987; Maloney and Landy, 1989), but it is not clear that what we perceive under these circumstances, whether it be a weighted average, a “winner takes all,” or any other outcome, should be regarded as “illusory.” An alternative deﬁnition of what constitutes an illusion – which would be something like “a lack of correspondence between the available information 1

Ignoring, for the moment, any small differences in the accommodation demands of the near and far corners of the room.

71

72

B. Rogers

Figure 4.1 In the Ames room, the actual shape of the room is trapezoidal, with the left corner being farther away than the right, but the room is painted with windows and doors and other features to provide perspective information that is consistent with a normal rectangular room.

and what we see” – is not without its own problems, however. Information is not a neutral concept but has to be deﬁned with respect to a particular organism and the characteristics of its particular sensory systems (Rogers, 2010). For example, what is regarded as information about the color of a surface depends on whether the observer is trichromatic or dichromatic and on all the complexities of the spatial comparisons that underlie human color vision.

4.2

The telestereoscope

With these cautionary comments in mind, how should we understand the perception of the world viewed through a telestereoscope? Helmholtz was one of the ﬁrst to build a telestereoscope, which is a device for increasing (or decreasing) the effective interocular separation of the two eyes (Helmholtz, 1910; Howard and Rogers, 1995). Typically, it consists of two pairs of parallel mirrors, oriented at angles of ±45◦ to the line of sight. The mirrors have the effect of increasing the effective distance between the viewing positions (vantage points) of the two eyes (Figure 4.2). Note that keeping the pairs of mirrors parallel to each other ensures that the visual axes remain parallel when the observer is viewing objects at inﬁnity, as depicted in Figure 4.2. However,

Information, illusion, and constancy in telestereoscopic viewing

Figure 4.2 A drawing of a telestereoscope from Helmholtz’s Treatise on Physiological Optics of 1867.

Telestereoscopic viewing

Direct viewing

(a)

(b)

Figure 4.3 (a) Increasing the effective interocular distance with a telestereoscope increases the vergence demand of any object in the scene. (b) Doubling the effective interocular distance is equivalent to halving the scale of the entire visual scene and viewing with a normal interocular distance.

because the distance between the effective viewing positions of the two eyes is increased, the vergence requirement for any closer object in the world is increased (Judge and Miles, 1985; Judge and Bradford, 1988) (Figure 4.3a). In fact, the vergence requirement is increased by the same factor as the increase

73

74

B. Rogers in the effective separation of the viewing positions, as Helmholtz himself noted (Helmholtz, 1910, p. 312). Apart from an increased vergence demand for closer objects, what are the other consequences of telestereoscopic viewing? It is straightforward to show mathematically that doubling the effective interocular distance with telestereoscopic viewing has the consequence that the binocular disparities created by a given object will be increased by a factor of two. However, a more straightforward way to think about the effect of telestereoscopic viewing is that the scale of the entire visual scene is changed by the same ratio as the change in the interocular distance – doubling the interocular distance is equivalent to halving the depicted dimensions of the entire scene (Figure 4.3b). Note that a halving of all the depicted dimensions has the effect of preserving all the information about the 3D layout and shape of objects within the scene, as well as the ratios of all sizes, depths, and distances. To summarize, increasing the effective interocular separation using a telestereoscope produces (i) the same patterns of light to the two eyes (optic arrays) and (ii) the same vergence requirements as would be created by viewing a miniature (doll’s house) version of the scene. Like the Ames room, telestereoscopic viewing provides perspective information about the 3D shape and layout of the scene, but it is different in the sense that there is vergence and verticaldisparity information about absolute distance, which is absent in the Ames room. Moreover, just as no (monocular) seeing machine could ever distinguish between an Ames room and a real rectangular room, so no binocular seeing machine could distinguish between a telestereoscopically viewed scene and a miniature version of the same scene.2 There is yet another similarity between telestereoscopic viewing and the Ames room. Both situations provide conﬂicting information when there are objects of a known size in the scene. In the Ames room, the presence of two people of known size signals that those people are at different distances, and this is inconsistent with perspective features of the room. In telestereoscopic viewing, the presence of objects of known size signals that all those objects are at their actual distances, while the vergence demand and vertical-disparity cues signal a scaled-down, miniature scene. Hence there is no reason to suppose that observers will perceive anything other than a perfect miniature scene under telestereoscopic viewing when no familiar objects are present, because there is no information to specify anything else.3 It is an open question, however, as to what observers will perceive if the scene does contain 2

3

Once again, ignoring any differences in the accommodation demands of the miniature scene. This is analogous to the Ames room without the people.

Information, illusion, and constancy in telestereoscopic viewing familiar objects, because, under these circumstances, the observer is presented with conﬂicting information. Helmholtz (1910) was in no doubt about what he saw under telestereoscopic viewing. He wrote the following: “If the mirrors are adjusted so that inﬁnitely distant objects are seen through the instrument with parallel visual axes, it will seem as if the observer were looking not at the natural landscape itself, but at a very exquisite and exact model of it, reduced in scale in the ratio of the distance between r1 and ρ1 to that between r and ρ” (my italics). Although quite believable, it could be argued that Helmholtz’s statement is more of a subjective impression than a conclusion based on the results of systematic psychophysical measurement. Given this fact, it is surprising that there have been so few empirical studies to investigate human observers’ perception of the world under telestereoscopic viewing conditions in a more systematic way. Let us ﬁrst consider in more detail the reasons why observers might not see “a very exquisite and exact model” under telestereoscopic viewing. The ﬁrst reason (mentioned already) is that if the scene contains objects of a known size, this might bias our perception away from the information provided by the vergence demand and vertical disparities and towards that of the actual scene. But there is a second reason. The perception of both the size and the depth of objects depends on the correct scaling of retinal sizes and binocular disparities. Both size and disparity scaling require estimates of the absolute distance, and theoretical considerations show that these estimates can be derived from either the vergence angle of the eyes or the gradient of vertical disparities (Gillam and Lawergren, 1983; Rogers and Bradshaw, 1993). Under telestereoscopic viewing, the depicted distance of all objects is altered with respect to both the vergence demand of the eyes and the gradients of vertical disparities. (In passing, it is interesting to note that there has been almost no mention in the previous literature of the possible effects of the altered vertical-disparity gradients in telestereoscopic viewing.) Hence, unless size and disparity scaling is perfect (100%) for the scale changes created by telestereoscopic viewing, we can make a strong prediction that there will be errors or distortions in our perception of the scaled-down scene, even if the scene contains no familiar-sized objects. 4.3

Size and disparity scaling

Of the two possible reasons why observers might not see a telestereoscopically viewed scene in exactly the way predicted by the geometry, we know more about size and disparity scaling than we do about the role of familiar-sized objects. We know, for example, that the 3D shape constancy for

75

76

B. Rogers a disparity-speciﬁed surface can be as low as 25%4 in a situation in which the only source of information about absolute distance is the vergence demand of the surface (Johnston, 1991). Bradshaw et al. (1996) found that the constancy was higher, at around 40%, using a large-visual-ﬁeld display where both vergence and vertical-disparity cues were present. When 3D shape constancy was measured in a well-lit, full-cue situation that included appropriate vergence, vertical-disparity, and accommodation information, the constancy was as high as 75% over a range of distances between 38 and 228 cm (Glennerster et al., 1996). It is important to remember that the results of all three studies were based on judgments about the shape of a single object or surface. When observers were asked to match the amount of perceived depth in a pair of surfaces at different distances – a relative rather than an absolute judgment – the constancy was found to be close to 100% (Glennerster et al., 1996). The authors of that publication argued that the improved constancy in the latter task was because the task requires only an estimate of the ratio of the distances to the two surfaces, rather than the absolute distances. It is also important to note what is meant by “less than 100% constancy.” In the Glennerster et al. (1996) experiment, as well as in many other previous studies, depth and 3D shape were typically overestimated (enhanced) for surfaces at close distances (< 70 cm) and underestimated (ﬂattened) for surfaces at far distances (> 100 cm). The 3D shape was perceived more or less veridically at some intermediate, or abathic, distance (Gogel, 1977; Foley, 1980). These results are consistent with the idea that the absolute distance used in constancy scaling is overestimated for close surfaces and underestimated for far surfaces (Johnston, 1991). What implications do these ﬁndings have for telestereoscopic viewing? As we pointed out earlier, increasing the interocular distance by a factor of two leaves retinal sizes unaffected, while the binocular disparities of objects in the scene are increased by a factor of two. In addition, the vergence demand of any object in the scene is doubled,5 and hence the absolute distance signaled by the increased vergence is halved. Moreover, the same relationship holds for the absolute distance speciﬁed by the gradient of vertical disparities for any particular surface (Rogers and Bradshaw, 1993). As a consequence, we should predict that if the scaling of both retinal size and disparity were 100%, the perceived scene should be seen as a perfectly scaled-down miniature version of the original. Since previous results show that both perceived size and perceived depth are underestimated at large viewing distances, we might predict that perceived size and depth should be more veridical under telestereoscopic viewing conditions 4

5

25% constancy means that the compensation for different viewing distances is only 25% of that required for complete constancy. To a ﬁrst approximation.

Information, illusion, and constancy in telestereoscopic viewing because the increased vergence demand brings all objects in the scene closer to the observer and thus within the abathic region, where the size and depth scaling is closer to veridical. In other words, the results of previous studies of size and disparity scaling lead us to the rather counterintuitive prediction that scaling should be more veridical under telesteroscopic viewing conditions than under normal viewing! 4.4

Telestereoscopic viewing: two predictions

We now have two testable predictions. First, the telestereoscopic viewing of a completely abstract 3D scene (with no familiarity cues to the sizes or distances of parts of the scene) should lead to the perception of a miniature version of the original scene, in which any departures from the geometric predictions can be attributable only to imperfect size and depth scaling rather than to the miniaturization of the scene. Second, the telestereoscopic viewing of a real 3D scene that contains many cues to the sizes and distances of different objects should lead to the perception of a scaled-down model of the original scene, but the extent of the miniaturization might not be as large as that predicted by the geometry, because familiarity cues might mitigate or override the information provided by vergence and vertical disparities. In the limit, familiarity cues might dominate completely and observers might see no difference in the scale of the telestereoscopic version of the scene compared with normal viewing, despite the presence of conﬂicting vergence and vertical-disparity cues. What happens in practice? Apart from the informal observations described by Helmholtz and others, there have been few attempts to measure the effects of telestereoscopic viewing quantitatively until a recent paper by Glennerster et al. (2006). Instead of using a conventional telesteroscope, Glennerster et al. used a virtual reality setup to display a simple, rectangular “room” (3 m wide × 3.5 m deep) in which the walls were lined with a brick texture. The virtual room, seen via a head-mounted display, could be expanded or contracted by a factor of two (i.e., made larger or smaller) with respect to its “normal” size. The expansion or contraction of the virtual room was coupled to the observer’s position in the room as s/he walked from side to side across the open end of the room. One can think of this virtual room as a dynamic telestereoscope in which the effective interocular distance was reduced by a factor of four (from ∼13 cm to ∼3.25 cm) as the observer walked from the left-hand side to the right-hand side of the room. In keeping with the cautionary comments made earlier, it is important to stress that when it is claimed that the room “expanded” or “contracted,” this refers to the change in the vergence demand of all features on the walls and ﬂoor of the room, as well as to the pattern of vertical disparities. The angular size of

77

78

B. Rogers the walls and of the bricks making up the walls did not change, and therefore the observer was presented with conﬂicting information that the room did not either expand or contract. The authors reported that their ﬁve observers “failed to notice when the scene expanded or contracted despite having consistent information about scale from both distance walked and binocular vision.” Unfortunately, the authors asked their observers only about their impressions of the change in size of the room and their judgments of the change in the apparent size of a suspended cube, and therefore it is impossible to extrapolate from these results to say anything about their perception of the room or the extent of constancy under static conditions. For example, it would have been useful to have known whether the observers’ initial impressions of what was depicted as a “half-size room” were of a small room that did not change its size during expansion or whether the room was seen as “normal” in size both before and after the expansion. Moreover, we do not know whether the observers’ failure to notice the size change was a consequence of the gradual change of vergence demand as the observer walked from one side of the room to the other. Would the same result hold for an abrupt “expansion” in which the vergence demand changed abruptly? To be fair to the authors, they did repeat the experiment using an abrupt change between the their “small” and “large” rooms, and they found that “size matching across different distances was better than with the smoothly expanded room,” but unfortunately this manipulation was confounded by a second manipulation in which the room no longer had a ﬂoor or ceiling, so we have no conclusive answer to that question either. How does the Glennerster et al. experiment relate to the question of whether the familiarity of the scene plays a role in telestereoscopic viewing? As mentioned previously, the walls of the room were lined with a brick texture, and this may have provided some useful information about the size of the room, assuming that the observers knew something about the usual sizes of bricks. However, if Figure 1 in Glennerster et al. (2006) is a correct depiction of the actual room, the 12 rows of bricks in the “normal” room (height ∼3 m) actually provide information that the room is much smaller than its depicted size of 3 m × 3.5 m. A second cue comes from the fact that the display provided information about the observer’s eye height above the ﬂoor that was correct only for the “normal” room. However, since Glennerster et al. reported “an equally strong subjective impression of stability” in a modiﬁed version of the experiment in which there was no ﬂoor or ceiling, this suggests that eye height information does not play a signiﬁcant role. The results of the Glennerster et al. study are difﬁcult to interpret in yet another way. The authors asked observers whether they noticed that (i) the

Information, illusion, and constancy in telestereoscopic viewing room changed in size and (ii) a “cube” (actually a frontal square, since it had no depth) changed in size as the observer walked from left to right across the “room.” It is not clear, however, what was meant by the question. It could refer to the distal information – did the perceived size of the cube in centimeters change as the observer moved from one side to the other? However, it could also be interpreted as referring to proximal information about angular size. In the Glennerster et al. experiment, there was no change in the angular size of the room or the angular size of the bricks as the observer moved from one side to the other and therefore, in angular terms, there was no size change. In addition, Howard (2008) has made the point that the absence of a change in retinal size is a powerful cue that the distance to a surface has not changed. Hence, if the information provided by this cue was of an unaltered distance and the angular size was also unchanged, it would not be unreasonable for observers to conclude that the actual size of the room had not changed. According to this interpretation, there is simply a conﬂict between the change in the distance information from vergence demand and vertical disparities, signaling that the “room” had changed in size, and the angular-size information, signaling that the room had not changed in its 3D size. This interpretation is signiﬁcantly different from Glennerster et al.’s claim that “human vision is powerfully dominated by the assumption that an entire scene does not change size.” The authors weaken their case still further by suggesting that an analogous assumption (i.e., that the entire scene does not change size) “underlies the classic ‘Ames room’ demonstration.” They go on to say, “In that illusion, the two sides of the room have different scales but appear similar because observers fail to notice the gradual change of scale across the spatial extent of the room” (my italics). Even if our visual systems did rely on such an assumption, it would have no bearing on the Ames room demonstration. To reiterate, an Ames room, properly constructed, creates the same pattern of light (optic array) to a monocular observer as would be created by a real rectangular room (Rogers, 2004). There is no “gradual change of scale” to notice! 4.5

Four experimental questions

How can we answer these outstanding questions? First, we would like to know whether the familiarity of the scene plays a role in telestereoscopic viewing. To do this, we need to contrast the effects of changing the interocular distance while viewing familiar scenes, on the one hand, and scenes containing no familiarity cues as to their size and distance, on the other. (The brick-textured walls used by Glennerster et al. might be considered to fall somewhere in between these two extremes.) Second, we would like to know if there is a

79

80

B. Rogers difference between the effects of a gradual and an abrupt change of interocular separation on the perception and judgments of the scene’s dimensions. Third, it would be useful to know whether any effect on those judgments was due to (i) the change in vergence demand or (ii) the change in the gradient of vertical disparities. As was pointed out previously, changing the interocular distance in a conventional telestereoscope changes both factors, and by the same amount. (It is not clear from the text whether the change in the gradient of the vertical disparities as the observer moved from side to side could be detected by observers in the Glennerster et al. experiment.) Fourth, we need a better way of measuring the perceptual consequences of a change in interocular distance than relying solely on judgments of perceived or matched size, which are susceptible to ambiguity in interpretation, as well as to bias. The following experiment goes some way to answering these questions (Rogers, 2009). First, in order to maximize the possible effects of prior knowledge and familiarity cues, our scenes were actual indoor locations that contained many familiar-sized objects, such as desks, chairs, and books (Figure 4.4). The scenes themselves were known and recognized by all the observers. While

(a)

(b) Figure 4.4 Pairs of stereo images used in the experiment, which are suitable for crossed fusion.

Information, illusion, and constancy in telestereoscopic viewing

Figure 4.5 Plan view of a large-ﬁeld stereoscope used to present disparate images. The images were projected by two LCD projectors onto translucent screens 57 cm from the observer’s eyes and viewed via mirrors at ±45◦ to the straight-ahead direction. The binocular ﬁeld of view was 70◦ × 55◦.

this does not allow us to contrast the effects of familiar and unfamiliar scenes in the way we would have liked, it does mean that if scene familiarity works against appropriate scaling by vergence and vertical-disparity cues, as Glennerster et al. believe, the very familiar scenes that we used should be the most likely to show the predicted reduction in the amount of scaling. Second, the displays in our experiment were designed to ensure that there was abundant (and nonconﬂicting) information about the size of the room from both vergence demand and vertical disparities. Unlike the majority of telestereoscopic devices, which have limited ﬁelds of view, our presentation system allowed images of 70◦ × 55◦ to be displayed. To achieve this, separate rear-projected images were displayed on screens 57 cm from the observer’s eyes and viewed via mirrors oriented at ±45◦ to the line of sight (Figure 4.5). Previous results have shown that observers can use the vertical-disparity information presented under these conditions (Rogers and Bradshaw, 1993, 1995; Bradshaw et al., 1996). 4.6

Methods and procedure

The images used in the experiment were captured using a highresolution digital camera with a distortion-free, wide-angle lens. To capture the images, the camera was positioned so that the “eye height” (120 cm) was identical to that of the observer seated in our stereoscope, and the sizes of the projected images were adjusted so that all visual angles were identical to those of the real scene.6 As a consequence, the optic arrays at the observer’s eyes were identical to those created by the real 3D scene. To mimic the effects of telestereoscopic viewing, pairs of stereoscopic images were captured using 6

The only inconsistency in the reproduction of the original scenes was in the absence of appropriate accommodation cues in our large-ﬁeld stereoscope, since the images were displayed on screens at a constant accommodation distance of 57 cm from the observer.

81

82

B. Rogers camera separations of 3.1 cm (one half of normal interocular distance), 6.2 cm (normal interocular distance), 12.4 cm (double interocular distance), and 24.8 cm (quadruple interocular distance). Geometrically, these stereo pairs precisely mimicked the viewing of a double-sized scene (with normal interocular distance), a normal-sized scene, a half-sized scene, and a quarter-sized scene. Hence the size ratio of the largest-sized to the smallest-sized depicted scene was 8:1. The ﬁrst scene (Figure 4.4a) was chosen as a richly furnished version of Glennerster et al.’s 3 m × 3.5 m plain brick room in which the far wall lay in a more-or-less frontal plane, while the second scene (Figure 4.4b) was an oblique view of a more extended 3D environment. To familiarize the observers with the tasks that they would be asked to do under telestereoscopic viewing conditions, observers were ﬁrst shown the stereo image pairs taken with a camera separation of 6.2 cm in order to reproduce the normal viewing of the two different 3D scenes. Observers were asked ﬁrst to describe the sizes and distances of the two scenes and their contents. In particular, they were asked about the height of the ceiling, the distance to the ﬂoor, and the dimensions of key objects in the scene using whatever units they were most familiar with. Observers were asked similar questions in the main experiment, in which a randomized set of stereo pairs, derived from the four different camera separations, was presented. Viewing time was unlimited. Although the observers were explicitly asked to comment on the apparent sizes and distances, they were never given any expectation that the depicted scenes might appear to be different in scale or dimensions in different trials. Although the data obtained from observers’ estimates of size and distance can be useful, we know that such estimates are susceptible to bias. In addition, observers are not always good at giving reliable and consistent estimates. As a consequence, we also asked observers to make 3D shape judgments (Johnston, 1991; Glennerster et al., 1996). In particular, we asked observers to vary the disparity-speciﬁed slant angle of a horizontally oriented ridge surface covered with random texture elements until the ridge angle appeared to be 90◦ . Not only do observers ﬁnd this task straightforward but there is also no way that observers could guess the correct answer. Using this technique, we have a much more objective measure of the extent to which stereoscopic information (vergence demand and vertical-disparity gradients) is either utilized or “ignored” in telestereoscopic viewing (Glennerster et al., 2006). 4.7

The geometry of telestereoscopic viewing

Let us return to the geometry of the situation in order to predict what should happen to the 3D shape judgments. Doubling the interocular separation

Information, illusion, and constancy in telestereoscopic viewing has no effect on the angular size of objects in the scene but it increases the vergence demand by a factor of two, and hence it halves the depicted distance to all parts of the scene. This means that if retinal size is scaled completely (100%) by the vergence distance, the perceived size should exactly halve. Doubling the interocular separation also doubles retinal disparities, but because the disparities created by any 3D object vary inversely with the square of the distance (Howard and Rogers, 1995), perceived depth should halve if scaling is 100%. If scaled size in an x–y direction is halved and scaled depth in a z direction is halved, perceived 3D shape should be unaffected by telestereoscopic viewing. In other words, with perfect scaling, observers should perceive a miniature version of the original scene with perfect shape constancy, as Helmholtz originally claimed. On the other hand, if size and/or disparity scaling is incomplete, perceived shape should be affected. As a consequence, asking observers about the shape of a 3D object such as the ridge surface used in our experiment should provide a bias-free measure of the degree of 3D shape constancy under telestereoscopic viewing. An alternative way of making the same point is to consider the disparity gradients created by the ridge surface. Doubling the interocular separation has the geometric effect of doubling the disparity gradient of each ﬂank of the ridge surface but it also increases vergence demand by a factor of two, and hence the depicted distance is halved (to a ﬁrst approximation). Since the disparity gradient of a surface with a particular 3D slant varies inversely with the viewing distance (Howard and Rogers, 1995), the perceived slant should be unaffected by the increased interocular separation if disparity gradient scaling is 100%. Previous results have shown that disparity gradient scaling is, in reality, less than 100%. As a consequence, slanted surfaces at far viewing distances (> 150 cm) are generally perceived to have a shallower slant than would be predicted by the geometry, while slanted surfaces at close viewing distances (< 50 cm) are generally perceived to have a steeper slant than would be predicted (Bradshaw et al., 1996). Two different procedures were used to measure the constancy of 3D ridge surfaces under telestereoscopic viewing. In the ﬁrst, a vertically oriented 90◦ ridge surface covered with a random texture pattern was introduced into the 3D scene (Figure 4.6). Observers were asked to judge the ridge angle (in degrees) in each of the four conditions taken with different camera separations. Note that because this was a real 3D object in the scene, the disparity gradient of the ridge surface in the stereo pair taken with an camera separation of 3.1 cm (half normal interocular distance) was half that with normal viewing. The disparity gradients of the ridge surface in the stereo pairs taken with camera separations of 12.4 cm (double normal interocular distance) and 24.8 cm (quadruple normal interocular distance) were double and quadruple, respectively, those of normal

83

84

B. Rogers

Figure 4.6 A magniﬁed portion of one of the two 3D scenes used in the experiment, showing the real 90◦ ridge surface on the right alongside the LCD monitor on the left that was used to display a simulated, adjustable ridge surface.

viewing. However, since the vergence demand of the stereo pair taken with a camera separation of 3.1 cm was doubled, the 90◦ ridge surface should appear to have the same ridge angle if disparity gradient scaling is 100%. In other words, the geometry of telestereoscopic viewing predicts that the shapes of any 3D objects should remain the same. A corollary of this statement is that any slight departure from a perception of a 90◦ ridge should provide reliable evidence of incomplete scaling. A second, adjustment procedure was used to measure the extent of constancy under different telestereoscopic viewing conditions. While it might be difﬁcult for observers to accurately report the angle of a particular ridge surface in degrees, most observers are capable of adjusting the angle of a ridge surface until it appears to be 90◦ (Glennerster et al., 1996). To create an adjustable 3D ridge surface in our naturalistic scenes, one of the objects in the scene was an LCD monitor screen that stood on a table to the right of the scene at a distance of 114 cm (Figures 4.4a and b). No images were ever displayed on the monitor screen when it was photographed with the different camera separations but, using Photoshop, the appropriate disparate images of a horizontally oriented, random-textured ridge surface were superimposed on the dark regions of the screens (Figure 4.6). Multiple, identical stereopairs of same real-world scene, for each of the four different camera separations, were created, and a textured ridge surface with a different disparity gradient was superimposed on

Information, illusion, and constancy in telestereoscopic viewing each of the pairs. The complete set of ridge surfaces had disparity gradients that ranged from one half of the disparity gradient created by a 90◦ ridge surface in the “double-sized” room (camera separation halved) to double the disparity gradient created by a 90◦ ridge surface in the “quarter-sized” room (camera separation × 4). The observer’s task was to use the up–down cursor keys on a keyboard to select the stereo pair in which the ridge surface superimposed on the monitor screen appeared to be 90◦ . Note that, for any particular trial, the stereo images of the rest of the scene were identical (and corresponded to one of those captured with the four different camera separations) – only the disparity gradient of the superimposed ridge surface was changed. It is also important to note that if disparity and size scaling were 100%, the disparity gradient of the surface corresponding to a 90◦ ridge would have to be very different for each different-sized scene. For example, the disparity gradient of a 90◦ ridge surface set in a stereo pair with a camera separation of 12.4 cm (double normal interocular distance) would have to be double that for a stereo pair with a camera separation of 6.2 cm (normal interocular distance), since the depicted distance of the LCD monitor and the ridge surface was halved. Hence observers could have no way of knowing which was the appropriate stereo pair to choose. The corollary of this statement is that if observers “ignored stereo cues” (Glennerster et al., 2006) and there was no scaling of disparities and disparity gradients, observers should choose a ridge surface with the same disparity gradient regardless of the scale of the depicted scene. There were seven observers (including the author), six of whom were naive to the purpose of the experiment. 4.8

Results

After an initial exposure to stereopairs depicting the “normal-size” room, observers were presented with a randomized sequence of 40 trials, in which they saw each of the two different scenes under each of the four different magniﬁcations/miniﬁcations (ﬁve repetitions for each condition). In each trial, they were asked to describe what they saw. All seven observers spontaneously reported differences in the size of objects and the scale of the scene. For the scenes taken with double and quadruple camera separations, which geometrically specify a miniature, scaled-down room, observers spontaneously gave reports such as “it looks like a model room,” “the ceiling is very low and the ﬂoor is raised,” “it looks squashed, everything smaller and closer,” and “the monitor appears very close.” These reports contrast with those of the observers in the Glennerster et al. experiment using a virtual reality setup, in which it was found that “none of the subjects we tested noticed that there had been a change in size of the room.”

85

86

B. Rogers In the ﬁrst of the two procedures for measuring the extent of disparity scaling, we presented a real, vertically oriented ridge surface (Figure 4.6) and observers were asked to comment on the perceived angle of the ridge. Not surprisingly, observers reported that the physical ridge appeared to have an approximately 90◦ ridge angle under “normal” (6.2 cm camera separation) viewing conditions. As mentioned previously, if disparity gradient scaling were 100%, observers should also perceive the physical 90◦ ridge to be 90◦ under all telestereoscopic viewing conditions, although the ridge surface itself should also appear to be either smaller and closer or larger and farther away. On the other hand, if there were no scaling at all, observers should perceive the ridge to have double the slant (i.e., a steeper ridge) with double the interocular distance and half the slant with half the interocular distance, since the disparity gradients are doubled and halved, respectively, in these two conditions. Our observers reported that there were small differences in the perceived angle of the “real” ridge surface. When the interocular separation was doubled or quadrupled, observers thought that the ridge angle was slightly steeper than 90◦ , and slightly shallower than 90◦ when the interocular separation was halved. This result suggests that there was a fairly high degree of disparity scaling over the 8:1 range of telestereoscopic viewing conditions, but it should be remembered that these observations were necessarily subjective and susceptible to bias. The results of the ridge adjustment task provided quantitative evidence of the amount of scaling. Figure 4.7 shows the expected patterns of results for the two extreme possibilities: 100% constancy and 0% constancy. On the abscissa is the independent variable: the particular telestereoscopic viewing conditions, expressed in terms of the vergence demand required by the simulated ridge surface. Under normal viewing conditions (camera separation = 6.2 cm), the vergence demand created by the LCD monitor was 186 arcmin, since the monitor was 114 cm from the observer. Doubling the camera separation to 12.4 cm meant that the vergence demand doubled to 371 arcmin or a depicted distance of 57 cm from the observer, and so on. Note that the abscissa scale is linear in terms of vergence angle, since this was what was manipulated, but nonlinear in terms of viewing distance (top scale). The ordinate shows the degree of scaling expressed as an “effective vergence angle.” From the geometry, we know about the relationship between the vergence angle (or distance) and the disparity gradient created by a 90◦ ridge surface. Hence, if an observer selects a particular surface as appearing to have a 90◦ ridge angle, the disparity gradient can be expressed in terms of the vergence angle (or distance) at which that particular ridge surface, with its particular disparity gradient, would correspond geometrically to 90◦. Thus if an

Information, illusion, and constancy in telestereoscopic viewing 90° ridge settings expressed as "effective vergence angles" 57

114 228 cm 114 228 cm

28

57

200

400 0% constancy 600

800

1000 1000

Perfect (100%) constancy

800 600 4 × IOD

28

Effective vergence angle (min arc)

0

400 200 0 2 × IOD Normal IOD/2

Simulated vergence angle (min arc) Figure 4.7 The predicted results of the experiment if there was (i) 100% constancy scaling (45◦ line) or (ii) 0% scaling (horizontal line). The results are plotted in terms of the “effective vergence angle” (ordinate) against the “simulated vergence angle” (abscissa). (IOD, interocular distance.)

observer always selected a ridge surface with a disparity gradient that corresponded precisely to the disparity gradient created by a real 90◦ ridge at the simulated vergence angle of the monitor (100% scaling), the “effective vergence angle” would be the same as the simulated vergence angle. If this were the case, the data points for the four different telestereoscopic conditions would all fall along the positive diagonal. On the other hand, if an observer always chose a ridge surface with the same disparity gradient for each of the four different telestereoscopic conditions (0% scaling), the effective vergence angle would always be the same and the data points would all fall along a horizontal line. Expressing the data in this way allows us to see clearly the extent to which 3D shape scaling is complete, since it corresponds to the slope of the best-ﬁtting straight line through the data points. Figure 4.8 shows the results from one observer (the author) for the two different scenes. These data show clearly that there is substantial scaling of 3D shape. The slope of the best-ﬁtting straight lines through the four data points can be used as a measure of the degree of constancy, which, for the

87

90° ridge settings expressed as "effective vergence angles" 28 57 114 228 cm

(a) 0

Effective vergence angle (min arc)

BJR

57

200

114 228 cm

B. Rogers

400

28

600

800

1000 1000

800 600 400 200 0 4 × IOD 2 × IOD Normal IOD/2 Simulated vergence angle (min arc)

28

0

57

114 228 cm

BJR

57

200

114 228 cm

90° ridge settings expressed as "effective vergence angles"

(b)

400

600 28

Effective vergence angle (min arc)

88

800

1000 1000

800 600 4 × IOD

400 200 0 2 × IOD Normal IOD/2

Simulated vergence angle (min arc) Figure 4.8 The results for one observer, showing the extent of scaling as a function of the four different telestereoscopic viewing conditions for the two different 3D scenes.

Information, illusion, and constancy in telestereoscopic viewing two different scenes, corresponds to 64% and 58%, respectively. If we consider only the degree of constancy over the fourfold change of size from one-half camera separation to double camera separation (as used in the Glennerster et al. experiment), the degree of constancy rises to 75% and 80% for the two different scenes. The fact that these three data points lie below the positive diagonal for both scenes shows that the 3D shape was underestimated (i.e., a 90◦ ridge surface looks slightly ﬂattened) when the LCD monitor was depicted at distances of 57 cm and beyond, which is consistent with the ﬁndings of previous experiments (Bradshaw et al., 1996; Glennerster et al., 1996). In addition, there was a small degree of overconstancy when the camera separation was increased to 24.8 cm (when the monitor was depicted at a 28 cm viewing distance), which again ﬁts with previous ﬁndings (Glennerster et al., 1998). The pattern of results found for the one nonnaive observer was replicated for the other six observers (Figure 4.9). For all observers, the degree of constancy was highest in the region where the camera separation was varied between half and double (average slopes = 88% and 79% for the two different 3D scenes) and fell off when the camera separation was quadrupled (four times interocular distance). In addition, it is very clear that the main trend in the individual differences is in shifting the curves up or down, while the slopes of the lines joining the points remain roughly parallel. A likely reason for this difference is that different observers may have had a different criterion for what constitutes a 90◦ ridge, and this bias is reﬂected in all their settings. It is also interesting that there are only small (and nonsigniﬁcant) differences between the data from the two different visual scenes. Changing the scene from a frontally viewed rectangular room (Figure 4.4a) to an oblique view of a much more extended scene (Figure 4.4b) appeared to have little or no effect on the pattern of results. 4.9

Summary of results

The results of our experiment reveal a high degree of constancy scaling with telestereoscopic viewing over an eightfold change in the effective interocular separation of the eyes. This was apparent both from our observers’ subjective reports of the distances and sizes of objects in the scene and from their systematic judgments of 3D shape. Observers could reliably judge the decreased or increased size and distance of objects in telestereoscopically viewed scenes despite the fact that the scenes contained many familiar objects of known size. We have no way of knowing (at present) whether the extent of constancy would be even higher if all cues of familiarity were eliminated but, nevertheless, we can make some comparisons of our results with the degree of constancy found in previous experiments when viewing distance alone was manipulated. These

89

(a)

90° ridge settings expressed as "effective vergence angles" 28 57 114 228 cm

0 All S's

57

Effective vergence angle (min arc)

200

114 228 cm

B. Rogers

400

28

600

800

1000 1000

800 600 400 200 0 4 x IOD 2 x IOD Normal IOD/2 Simulated vergence angle (min arc)

90° ridge settings expressed as "effective vergence angles" 28

0

57

114 228 cm 114 228 cm

(b)

All S's

57

200

400

600 28

Effective vergence angle (min arc)

90

800

1000 1000

800 600 4 × IOD

400 200 0 2 × IOD Normal IOD/2

Simulated vergence angle (min arc) Figure 4.9 The results for all seven observers, showing the extent of scaling as a function of the four different telestereoscopic viewing conditions (abscissa) for the two different 3D scenes.

Information, illusion, and constancy in telestereoscopic viewing comparisons are important, since we have assumed so far that any failure to correctly perceive the scaled-down (or scaled-up) size of objects under telestereoscopic viewing is due to the fact that the scene is altered in size compared with the original scene. Hence, knowledge about the actual size of objects in the scene, and thus their distances, would always work against an altered vergence demand that signaled an inconsistent and conﬂicting distance. There is, however, an alternative possibility. Telestereoscopic viewing also brings all objects closer to (or farther from) the observer. Unless size and disparity scaling fully compensates for the changing viewing distance (100% constancy), it is possible that any failure to correctly judge the size and 3D shape of objects in the scene under telestereoscopic viewing is due to the distances at which the objects are depicted rather than to telestereoscopic viewing per se. To answer this question, we need to look at the accuracy of 3D shape judgments at different viewing distances. Glennerster et al. (1996) used a ridge surface task similar to the one used in the present experiment to measure the veridicality of 3D shape judgments at different distances. Across a wide range of viewing distances, from 38 to 228 cm, they found systematic departures from veridicality. At close viewing distances (< 70 cm), they found that the gradients of a triangular-wave surface in depth (multiple ridges) were overestimated (they appeared steeper than they should have done for the depicted disparity gradients), while at far viewing distances (> 100 cm), the gradients were systematically underestimated (they appeared more ﬂattened than they should have done for the depicted disparity gradients). This is a pattern of results very similar to that found in the present experiment (Figure 4.9). Moreover, the degree of constancy found in the Glennerster et al. (1996) experiment was around 75% for their three observers, which is remarkably similar to the ﬁgure found in the present experiment. This comparison is important because it suggests that all of the failure to achieve 100% constancy in the telestereoscopic viewing experiment reported here was due to the 3D surface being presented at different depicted viewing distances. The corollary of this statement is that none of that failure was due to the altered sizes of familiar objects created by the telestereoscopic viewing situation. As a consequence, we ﬁnd no evidence that “stereo cues are ignored” but, rather, our ﬁndings appear to support Helmholtz’s claim that with telestereoscopic viewing, natural scenes appear like “a very exquisite and exact model.”

4.10

Reconciling the conflicting results

Why should our results and conclusions be so different from those of Glennerster et al. (2006)? Several factors may be responsible. First, it is not clear

91

92

B. Rogers that the changes in the vertical-disparity gradients presented by the brick wall images in Glennerster et al.’s experiment could be resolved by the observers (the pixel size was 3.4 arcmin and there was only 32◦ binocular overlap), and this may have led to a small reduction in the amount of scaling for the differently depicted room sizes (Rogers and Bradshaw, 1995; Bradshaw et al., 1996). Second, Glennerster et al. investigated whether observers noticed a change in the size of the room in a situation where they varied the interocular distance continuously. In the experiment reported here, we looked only at 3D shape judgments under static telestereoscopic viewing conditions. It is possible that the human visual system is relatively insensitive to the changes in vergence demand that accompany a continuous change in interocular distance. For example, Howard (2008) has provided good evidence that the absence of a change in angular size provides a powerful source of information that there is no change in the distance of objects. Informally, we have tried changing the camera separation gradually using our real-world scenes. For the real 3D ridge surface in the scene (Figure 4.6), the disparity gradient changes in direct proportion to the changes in the camera separation, but observers did not perceive any change in perceived slant, presumably because the vergence and vertical-disparity cues provided adequate information about the change in distance. Conversely, the perceived slant of the simulated ridge surface on the LCD monitor did change with gradual changes in the camera separation, even though the disparity gradient of the ridge surface remained the same. This suggests that the gradualness of the change in the interocular separation (and hence vergence demand) did not preclude the appropriate scaling of the disparity-speciﬁed ridge surfaces. However, although the 3D shape judgments appeared to follow the gradual changes in vergence demand, observers did not always immediately perceive the change in the scale of the depicted scene. This suggests that there may be a dissociation between the scaling of size and binocular disparities from vergence and vertical-disparity cues and the effectiveness of these cues as a basis for judging distance. Finally, previous research has shown that judgments of size and distance are more difﬁcult to make and more susceptible to bias when the object or surface is suspended in space, as was the case in the Glennerster et al. experiment. In the present experiment, as with Helmholtz’s original observations, there was a visible ground plane surface and the objects used for the judgments of size and 3D shape were all supported by surfaces clearly attached to one another and to the ground plane. These differences would not account for the failure of Glennerster et al.’s observers to perceive the changes in the depicted size of the room as a whole, but they might account for some of the failures of constancy in the matching task.

Information, illusion, and constancy in telestereoscopic viewing

4.11

Conclusions

In conclusion, it is not obvious whether there is a satisfactory way of distinguishing between those perceptions that should be classiﬁed as illusory and those that should be classiﬁed as veridical. All of our perceptions depend on the particular characteristics of the underlying mechanisms and it seems arbitrary to classify some perceptions, such as thresholds or metameric color matches, as “just the way the system works” while others are given the special label of an “illusion” (Rogers, 2010). What is important is to specify the information available in any particular situation rather than to describe how the particular pattern(s) of light reaching the eye(s) was/were created. For example, it is surely irrelevant that the Ames room is trapezoidal if there is no information about its trapezoidal shape. Telestereoscopic viewing of the world provides an interesting case study of a situation in which there is inconsistent and conﬂicting information about the scale of the scene, with the vergence demand and the vertical-disparity gradients indicating a scaled-down, miniature world while the size-of-familiarobjects cues signal the actual distance and size of the objects. The results of our experiment provide support for Helmholtz’s claim that we perceive “a very exquisite and exact model” under telestereoscopic viewing conditions and that the presence of familiar-sized objects appears to have little inﬂuence on what we perceive. Interestingly, this is also what we can conclude about the Ames room demonstration – the presence of familiar-sized objects has little inﬂuence on what we perceive.

References Bradshaw, M. F., Glennerster, A., and Rogers B. J. (1996). The effect of display size on disparity scaling from differential perspective and vergence cues. Vis. Res., 36: 1255–1264. Bülthoff, H. H. and Mallot, H. A. (1987) Interaction of different modules in depth perception. In Proceedings of the IEEE First International Conference on Computer Vision, pp. 295–305. Los Alamitos, CA: IEEE Computer Society Press. Foley, J. M. (1980). Binocular distance perception. Psychol. Rev., 87: 411–434. Gillam, B. and Lawergren, B. (1983). The induced effect, vertical disparity, and stereoscopic theory. Percep. Psychophys., 34: 121–130. Glennerster, A., Rogers, B. J., and Bradshaw, M. F. (1996). Stereoscopic depth constancy depends on the subject’s task. Vis. Res., 36: 3441–3456. Glennerster, A., Rogers, B. J., and Bradshaw, M. F. (1998). Cues to viewing distance for stereoscopic depth constancy. Perception, 27: 1357–1365.

93

94

B. Rogers Glennerster, A., Tcheang, L., Gilson, S. J., Fitzgibbon, A. W., and Parker, A. J. (2006). Humans ignore motion and stereo cues in favor of a ﬁctional stable world. Curr. Biol., 16, 4: 428–432. Gogel, W. C. (1977). An indirect measure of perceived distance from oculomotor cues. Percept. Psychophys., 21: 3–11. Helmholtz, H. von (1910). Physiological Optics. New York: Dover. 1962 English translation by J. P. C. Southall from the 3rd German edition of Handbuch der Physiologischen Optik. Hamburg: Vos. Howard, I. P. (2008). Exploring motion in depth with the dichoptoscope. Perception, 37 (supp.): 1. Howard, I. P. and Rogers, B. J. (1995). Binocular Vision and Stereopsis. New York: Oxford University Press. Howard, I. P. and Rogers, B. J. (2002). Seeing in Depth, Vol. 2. Toronto: Porteus. Johnston, E. B. (1991). Systematic distortions of shape from stereopsis. Vis. Res., 31: 1351–1360. Judge, S. J. and Bradford, C. M. (1988). Adaptation to telestereoscopic viewing distance measured by one-handed ball catching performance. Perception, 17: 783–802. Judge, S. J. and Miles, F. A. (1985). Changes in the coupling between accommodation and vergence eye movements induced in human subjects by altering the effective interocular distance. Perception, 14: 617–629. Maloney, L. T. and Landy, M. S. (1989). A statistical framework for robust fusion of depth information. Visual communication and image processing IV. Proc. SPIE, 1199: 1154–1163. Rogers, B. J. (2004). Stereopsis. In R. L. Gregory (ed.), Oxford Companion to the Mind, pp. 878–881. Oxford: Oxford University Press. Rogers, B. J. (2009). Are stereoscopic cues ignored in telestereoscopic viewing? J. Vis., 9(8): 288. Rogers, B. J. (2010). Stimuli, information and the concept of illusion. Perception, 39: 285–288. Rogers, B. J. and Bradshaw, M. F. (1993). Vertical disparities, differential perspective and binocular stereopsis. Nature, 361: 253–255. Rogers, B. J. and Bradshaw, M. F. (1995). Disparity scaling and the perception of fronto-parallel surfaces. Perception, 27: 155–179. Wade, N. J. (2005). Perception and Illusion: Historical Perspectives. New York: Springer.

5

The role of disparity interactions in perception of the 3D environment christopher w. tyler

5.1

Introduction

In understanding visual processing, it is important to establish not only the local response properties for elements in the visual ﬁeld, but also the scope of neural interactions when two or more elements are present at different locations in the ﬁeld. Since the original report by Polat and Sagi (1993), the presence of interactions in the two-dimensional (2D) ﬁeld has become well established by threshold measures (Polat and Tyler, 1999; Chen and Tyler, 1999, 2001, 2008; Levi et al., 2002). A large array of other studies have also looked at such interactions with suprathreshold paradigms (e.g., Field et al., 1993; Hess et al., 2003). The basic story from both kinds of studies is that there are facilitatory effects between oriented elements that are collinear with an oriented test target and inhibitory effects elsewhere in the 2D spatial domain of interaction (although the detectability of a contrast increment on a Gabor pedestal also reveals strong collinear masking effects). The present work extends this question to the third dimension of visual space as speciﬁed by binocular disparity, asking both what interactions are present through the disparity dimension and how these interactions vary with the spatial location of the disparate targets. Answering these questions is basic to the understanding of the visual processing of the 3D environment in which we ﬁnd ourselves. Pace Plato, the world is not a 2D screen onto which images are projected, but has the full extra dimensionality of a 3D space through which we have to navigate and manipulate the objects and substances of physical reality.

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

95

96

C. W. Tyler It is worth emphasizing that the addition of the third, z, axis to the two axes of x,y space does not just add another plane to the initial x, y plane, but extends its dimensionality in a multiplicative fashion to vastly expand the scope of possible interactions. Thus, whereas there are two dimensions of pointwise interactions for a single dimension of n points (i.e., the effect on the detectability of each x point of stimuli at all other x points), there are four dimensions of possible interactions in 2D space (x, y × x, y) and six dimensions of possible interactions in 3D space (x, y, z × x, y, z) that need to be evaluated in order to have a complete understanding of the visual processing of that space. Moreover, we cannot assume that visual space is homogeneous in its properties, as is classically assumed for physical space. The space of interactions is itself keyed to the absolute location of the targets in the space, adding an addition dimension of possible variations. The question of disparity speciﬁcity was addressed in reverse with a suprathreshold paradigm by Hess et al. (1997), who measured the secondary effect of relative disparity on the spatial interactions among targets at different spatial locations. Here, the issue is the direct question of the disparity-domain interactions between elements at different disparities and the same spatial location. The issue then generalizes to the six-dimensional space of the disparity, the x, y location, the orientation, the size, and the luminance polarity of a pair of targets. In each case, the interaction between two elements forms one relevant dimension, with the variation of this interaction over the range of absolute values as a second dimension. The total parameter space for pairwise interactions in 3D space is thus 12-dimensional. To keep the problem manageable, the enquiry is restricted to the horizontal (x) direction of space, focusing on the dimensions of absolute and relative disparity, absolute size, and relative x location and relative polarity. A note on the concept of absolute and relative disparity: one should distinguish between the concepts of “binocular disparity” and “convergence angle.” Thus, zero disparity for the absolute-disparity (z) reference frame is assumed to be the angle between the ﬁxation line when the eyes are converged on the ﬁxation point. (Zero convergence angle is given by convergence on optical inﬁnity, but its use as the zero for absolute disparity would require the awkward concept that targets with negative, or uncrossed, disparities relative to this convergence angle would have to be seen as more than an inﬁnite distance away.) The test targets are set laterally from the ﬁxation target either at the same (zero) disparity or at a range of different (absolute) disparities. The concept of relative disparity (z) comes into play when there are two targets in the visual ﬁeld (other than the ﬁxation point). The relative disparity is the difference in disparity between the absolute disparities of these targets. Also, for simplicity, it is assumed that

The role of disparity interactions in perception of the 3D environment the horopter of zero absolute disparity through space is a horizontal line in the frontoparallel plane extending from the ﬁxation point, which is a reasonable approximation within a few degrees of the ﬁxation point (Ogle, 1950; Tyler, 1991). A natural approach to the study of interactions is the masking paradigm, in which a salient masking stimulus is used to reduce the visibility of a second test stimulus. The presence of masking always implies a processing nonlinearity, because a linear system simply adds the mask to the test, without affecting the response to the test or its signal-to-noise ratio. There are two conceptual frameworks in which masking may be interpreted. Under the assumption of univariance, the mask and test are assumed to impact the same coding channel in an indistinguishable fashion, exploiting some processing nonlinearity to induce a variation in sensitivity to the test stimulus in the presence of the mask (Mansﬁeld and Parker, 1993). In a complex system, however, the mask and test may be processed by different channels with inhibitory interactions between them, as in the classic case of metacontrast masking (Foley, 1994). Such an inhibitory-masking interpretation does not imply univariance or that the data measure the sensitivity of the processing channels for either stimulus; instead, the data measure the inhibitory relations between them. Distinguishing between the univariant-channel masking and the inhibitory-interaction interpretation is not possible without additional information about the characteristics of the system. However, one strong criterion that can be employed is that univariant channel masking may generally be expected to decrease with distance between the test and mask along any stimulus dimension (such as position, disparity, or spatial frequency). If masking increases with distance, this implies an implausible channel structure, and the more likely interpretation is in terms of inhibitory interactions, as in the present treatment. The primary stage of stereoscopic processing may be considered as a local cross-correlation process, performed by neurons tuned to different disparities, occurring at each location in the binocular visual ﬁeld (Stevenson et al., 1992). The best match or correlation in each local region of the visual ﬁeld speciﬁes when the local images are in register (Figure 5.1). As the eyes vary their vergence, the cortical projections of the visual scene slide over one another to vary their projected shifts (or disparities), which may be termed the Keplerian array of the binocular projection ﬁeld. The 2D spatial structure of these local elements is elongated vertically for human stereopsis (Chen and Tyler, 2006). In practice there are, of course, both vertical and horizontal dimensions of ﬁeld location, and the physiological array may not be as regular as in this idealized depiction.

97

98

C. W. Tyler 5.2

Global interactions

Beyond the disparity registration stage described up to this point are the global interactions operating between the local-disparity nodes, which serve to reﬁne the representation of the disparity image from its initial crude array of stimulated disparities to a coherent representation of the 3D surfaces present in the ﬁeld of view (Tyler, 2005). A variety of such cooperative processes has been proposed by Julesz and others over the years, summarized in Julesz (1971, 1978) and Tyler (1983, 1991). There are two obvious types of mutual interaction between the disparity-selective signals for different regions of the stereo image: disparity-speciﬁc inhibition (along the vertical line in Figure 5.1) and disparity-speciﬁc pooling or facilitation (along the horizontal line in Figure 5.1). Each type may, in principle, operate anisotropically to different extents in the three-dimensional space of relationships through the Keplerian array: over the frontal plane, over disparities, or over some combination of the two. Such cooperativity among local-disparity mechanisms may be involved in solving the correspondence problem effectively (Tyler, 1975); it may also include such

Behind

Right

Left 0

Left eye

In front

Right eye

Figure 5.1 A Keplerian array of disparity detectors depicted by the intersections of the (oblique) lines of sight of the two eyes (see Kepler, 1611). The retinas of the left and right eyes are schematized as linear arrays, with one point in each array (open circles) indicated as the recipient of the image of the stimulus point in space (ﬁlled circle). The solid horizontal line depicts the horopter (x axis) and the solid vertical line the z axis; the dashed line indicates a plane at constant uncrossed disparity.

The role of disparity interactions in perception of the 3D environment processes as the disparity gradient limitation on the upper limit for depth reconstruction (Tyler, 1973), and coarse-to-ﬁne matching processes for building up the depth image from the monocular information (Marr, 1982). These processes all may be conceived as taking place within the locus of global interactions following the interocular matching or disparity registration stage (but preceding the generation of a uniﬁed global depth image from the plethora of available disparity information (Tyler, 1983, 1991)). The present work was designed to provide an initial survey of the scope of such disparity-domain interactions as a function of location, testing disparity, spatial frequency, and luminance polarity. 5.3

Local target structure

To maximize the local speciﬁcity of the probe, the stimuli in the experiments described here were simultaneously local with respect to four variables: eccentricity, extent, disparity, and spatial frequency. Previous studies have confounded these variables so that it was not possible to disambiguate the effects of retinal inhomogeneity from eccentricity, relative peak positions of the test and mask, and spatial-frequency effects. To avoid such phase artifacts and to maximize the masking effects, we adopted a local-stimulus paradigm based simply on Gaussian bar test stimuli (cf. Kulikowski and King-Smith, 1973). Such Gaussian bars allow measurement by contrast masking of our three requisite variables, namely position sensitivity, disparity sensitivity, and spatial-frequency (scale) selectivity (Kontsevich and Tyler, 2004). Gaussians have only one peak, so that the position and disparity of the masking bar can be varied cleanly relative to the peak of the test bar. The lack of side lobes in Gaussian bars makes them particularly suitable for the study of stereoscopic disparity tuning by a masking paradigm because there is no aliasing of the disparity signal by spurious peak coincidences (as there would be with narrowband wavelet stimuli, for example). The use of such Gaussian bars in peripheral vision allows both disparity and position to be varied within a homogeneous retinal region. Gaussian bars also provide a substantial degree of tuning in spatial frequency (Kontsevich and Tyler, 2004, 2005). If the Gaussian is smaller than the receptive-ﬁeld center, it provides less than optimal activation; if larger, it stimulates the inhibitory surround, tending to reduce the response. Gaussians thus have an optimal size tuning for center-surround receptive ﬁelds that translates into an effective peak spatial frequency. 5.4

Psychophysical procedure

The stimuli were Gaussian blobs to measure local position tuning, disparity tuning, spatial-frequency tuning, and polarity tuning at a location 5◦

99

100

C. W. Tyler to the left of the fovea by the masking-threshold paradigm, which gives a direct measure of the channel tunings underlying the masking behavior. The methods we described in detail in Kontsevich and Tyler (2005). The psychophysical task was to set the test contrast at some multiple (2–4) of threshold and to vary the masking contrast by the -adaptive staircase procedure (Kontsevich and Tyler, 1999) to the level that brought the test contrast back to its unmasked level of detectability. The advantage of this procedure is that it restricts the number of channels stimulated to those most sensitive to the test stimulus, ensuring that the results are focused on the most basic structure of the disparity-processing mechanisms. To take a maximally model-free approach to the question of local channel interactions through the Keplerian array, the masking-sensitivity paradigm, introduced by Stiles (1939) as the ﬁeld sensitivity paradigm, was employed. The test stimulus was set at a ﬁxed contrast level above its contrast detection threshold, and the mask contrast required to return the test to threshold was determined as a function of the tuning parameter (position, disparity, or scale). The result was a tuning curve through some aspect of the Keplerian array that had three advantages over threshold elevation curves: (1)

(2) (3)

5.5

Output nonlinearities of the masking behavior do not affect the shape of the masking function, because the test contrast is always at the same level above the detection threshold. The effect of the test on neighboring channels and their potential interactions is minimized because the test contrast is near threshold. If the channel structure is sufﬁciently discrete that the near-threshold test is invisible to neighboring channels, the mask probes the shape of the inhibitory interactions all the way down its ﬂanks to the maximum extent of masking. Position tuning

The ﬁrst experiment was a control study to determine the position tuning of the local contrast-masking effect with respect to the mask phase. With both test and mask of width 25 arcmin, the mask contrast required to reduce the test back to threshold was maximal, implying that the masking sensitivity was minimal beyond about 30 arcmin separation from the test (Figures 5.2a–c). It is noteworthy that this masking function does not follow the form of the overlap √ between the test and mask stimuli (which would be a Gaussian of 2 × 25 = 35 arcmin width at half height, as shown by the dashed line in Figure 5.2d). Instead, it has an idiosyncratic form that evidently reﬂects the tuning of the underlying spatial channel (under the channel invariance assumption laid out in the

The role of disparity interactions in perception of the 3D environment

Masking sensitivity

T M

NF

20 10

Masking sensitivity

(b) 10

(a) 30

Masking sensitivity

6 4 2 0 –60 –40 –20 0 20 40 60 Mask lateral shift (arcmin)

0 –60 –40 –20 0 20 40 60 Mask lateral shift (arcmin) (c) 20

LK

8

(d) LM

15 10 5 0 –60 –40 –20 0 20 40 60 Mask lateral shift (arcmin)

–60 –40 –20

0

20

40

60

Lateral position (arcmin)

Figure 5.2 Contrast masking as a function of lateral shift of the masking bar (M) from the test bar (T), both of which had proﬁles 25 arcmin wide at half height as shown by the solid line in (d). (a)–(c): Masking sensitivity with both mask and test at zero disparity (icon). Arrows at bottom depict test position. The mean standard errors were only ±0.04–0.09 log units, i.e., about the size of the symbols. (d) The Gaussian proﬁle of the test and the mask (solid line) and the expected test/mask interaction proﬁle (dashed line).

Section 5.1) rather than involuntary eye movements or retinal inhomogeneity, which should be consistent across observers. The ﬁrst disparity-related condition was used in an experiment to measure the degree of masking as a function of the lateral position of this disparate mask (dashed horizontal line in Figure 5.1). The mask disparity was set at 80 arcmin (40 arcmin displacement in each eye) for NF and 60 arcmin for LK. The masking effects (Figure 5.3) reveal the presence of long-range interactions across disparity, with their own tunings distinct from those found for monocular masking. The data in Figure 5.3 show that there are substantial long-range interactions over disparity that are speciﬁc to the binocular locations of the stimuli (rather than to the locations of their monocular components). However, a role for the monocular components cannot be dismissed entirely. In fact, there is masking in the 20–60 arcmin range, where it would be predicted by the monocularmasking hypothesis, and the function tends to exhibit a separate peak from the main binocular masking, implying that disparity masking is a combination of the displaced monocular masking and a purely binocular component.

101

C. W. Tyler 30

20

NF

T

10

0 –80 –60 –40 –20 0

20 40 60 80

Mask lateral shift (arcmin)

Masking sensitivity

10 M

Masking sensitivity

102

LK

8 6 4 2

0 –80 –60 –40 –20 0 20 40 60 80 Mask lateral shift (arcmin)

(a)

(b)

Figure 5.3 Masking sensitivity (thick line) when the mask is behind the test (80 arcmin disparity for NF and 60 arcmin for LK). The dashed lines depict the disparity masking expected from the combined monocular components. The arrows at bottom depict the test disparity. The mean standard errors over the data set were ±0.04 log units for NF and ±0.04 for LK.

A quantitative, parameter-free prediction of the extent of monocular effect is given by adding the contrast of the monocular components to half that of the binocular test stimulus. The predicted masking is shown by the dashed lines in Figure 5.3, which provide a fair account of the outer ﬂanks of the masking behavior, but the ﬁgure shows an additional binocular component centered on zero disparity of the disparity interactions that extends to about ±35 arcmin of positional shift. By this estimate, the purely binocular masking is comparable in frontal extent to the masking from the monocular components. 5.6

Disparity selectivity of contrast masking

The second experiment was to measure the masking sensitivity as a function of the stereoscopic disparity of the mask, which was varied in disparity along the same line of sight as the zero-disparity test stimulus. We know, from Figure 5.3, that there is substantial masking at large disparities, of the order of 1◦ (a half-disparity of ∼ 30 arcmin in each eye). To facilitate comparison with the monocular conditions, the masking sensitivity in this experiment is depicted in Figure 5.4 as a function of “half-disparity,” which can be conceptualized as the mask displacement from the standard test position in each eye. The data form neither a narrow peak like the monocular masking function nor a uniform mesa out to some disparity limit. Instead, the disparity masking exhibits an unexpected “batwing” shape without precedent in the stereoscopic literature. The masking was two or three times as strong at the peak half-disparities of ±40 arcmin for NF, ±30 arcmin for LK, and ±20 arcmin for LM as in the central region near zero disparity.

The role of disparity interactions in perception of the 3D environment

Masking sensitivity

NF

M 30 T 20 10 0 –80 –60 –40 –20 0

20 40 60 80

Mask half-disparity (arcmin)

Masking sensitivity

(b) 10

(a) 40

LK 8 6 4 2 0 –80 –60 –40 –20 0

20 40 60 80

Mask half-disparity (arcmin)

Masking sensitivity

(c) 30 25

LM

20 15 10 5 0 –60 –40 –20 0 20 40 Mask half-disparity (arcmin)

60

Figure 5.4 Masking sensitivity as a function of the disparity of a 25 arcmin masking bar at the same mean visual direction (icon) as the 25 arcmin test bar at zero disparity (arrow) for three observers. The solid lines depict the peculiar batwing form of the masking sensitivity, with a minimum near the test disparity and maxima further away from it at crossed and uncrossed disparities. The monocular masking sensitivity for the combined monocular positions implied by each mask disparity is plotted as a dashed line for comparison. The arrows at bottom depict the test disparity. The mean standard errors were ±0.04–0.09 log units.

For this experiment, the role of monocular masking is more complex to evaluate than for Figure 5.3, since it would require convolution with some assumed function of sensitivity to disparity. However, it seems worth noting that the masking in the central region (±20 arcmin) matches the shape that would be expected if this region were dominated by the monocular masking effect from Figures 5.2a–c (shown as thin lines in those ﬁgures). On this interpretation, the disparity-speciﬁc masking behavior is essentially restricted to two humps extending from about 15 to a maximum of 45–65 arcmin in half-disparity in both the near and the far directions. The width of the masking peaks is similar to that of the stimuli themselves, implying that the batwing form of the disparity masking could be attributable essentially to just two channels situated at the peak masking disparities. Convolving these locations with the stimulus proﬁle, and adding the unavoidable

103

104

C. W. Tyler Behind

Behind

NF

LK

60

60

Ha

0 –20 –40

in) cm (ar

–60

60 40 20 in) 0 rcm a ( –20 ft –40 shi ral e –60 t La

rity

n)

i cm (ar

0 –20 –40

40 20

a isp

lf-d

rity

a isp

lf-d

Ha

40 20

–60

40 20 0 in) –20 (arcm –40 ft shi –60 ral e t La

60

Figure 5.5 Masking sensitivities replotted from Figures 5.2–5.4, for the 25 arc min test bar at zero disparity as a function of the location of the 25 arcmin masking bar, projected onto orthographic views of the Keplerian array of Figure 5.1. The thick outline depicts the diaboloid form of the masking-sensitivity limit in this disparity space.

monocular component for the zero-disparity mask would explain much of the masking behavior. Such an interpretation implies astonishing speciﬁcity in the cross-disparity interactions. To provide a clearer representation of the signiﬁcance of the masking effects reported so far, Figure 5.5 plots the available data for two observers in 3D fashion on the plane of the horizontal Keplerian array depicted in Figure 5.1. Here the vertical dimension (thick curves) represents the masking sensitivity for a mechanism responding to a test stimulus at zero disparity and centered at the 5◦ peripheral location. The measured curves from Figures 5.2–5.4, together with additional slices, combine to deﬁne a biconical (“diaboloid”) form that expands along the monocular visual-direction lines as mask disparity increases. The outline (gray cross-section) is interpolated through the points where the masking-sensitivity curve meets the baseline and implicitly follows the masking-sensitivity functions elsewhere. The increase in masking strength with disparity away from the horopter may imply a transition from a ﬁne to a coarse disparity system, with the inhibition implied by the masking behavior operating in the coarse-to-ﬁne direction. 5.7

Size specificity of disparity masking

One may ask whether the range of disparity masking is (1) a unitary function or (2) speciﬁc for the different spatial-frequency channels. To access

The role of disparity interactions in perception of the 3D environment a variety of spatial-frequency channels was achieved by varying the width of the Gaussian stimuli constituting the test and mask stimuli. As described in Section 5.3, the Gaussian test proﬁle is a remarkably selective probe for particular spatial-frequency-selective mechanisms, below the peak of the sensitivity function (Kulikowski and King-Smith, 1973; Kontsevich and Tyler, 2004). The widths of the Gaussian test and mask stimuli were varied in tandem in oneoctave steps from 50 to 3 arcmin (corresponding to peak spatial frequencies of 0.49 to 7.8 cycles per degree). The masking-sensitivity functions exhibited two peaks at all spatial frequencies (Figure 5.6), but the peak masking disparities did not remain at the same disparities. Instead, the peaks shifted in rough correspondence with the change in stimulus width, peaking at approximately one stimulus width away from zero disparity all the way from the 50 arcmin to the 6 arcmin stimuli. (Note that, as mentioned, all effects discussed are at least a factor of two, and are highly statistically signiﬁcant relative to the errors, which are of the order of 0.06 log units, or about 15%.) The narrowest (3 arcmin) stimulus failed to narrow the disparity range further, as would be expected when the Gaussian stimuli pass the peak of the contrast tuning function (Kontsevich and Tyler, 2004). These data indicate that the size selectivity of disparity masking has at least a tenfold range from a width of ∼1◦ down to ∼6◦ of disparity at half height. On the other hand, the disparity range of disparity masking turns out to be asymmetric with spatial frequency over the 16-fold range of the measurements. If we focus attention on the upper limit of the masking range (the outer skirts of the functions in Figure 5.6), this limit varies by only about a factor of two (from about −80 to −50 arcmin) in the negative, or near, direction, compared with a factor of nearly ﬁve (from about +70 down to +15 arcmin) in the positive, or far, direction. There is a corresponding peak asymmetry for this observer that is essentially replicated in all ﬁve measured functions. The neardisparity masks produce shallower ﬂanks of masking than do the far disparities, especially for the narrower stimuli. Thus, the near/far disparity asymmetry is consistent across all measured Gaussian-stimulus widths. Note that this data set validates the batwing shape of the disparity-masking function, which becomes even more salient for the narrower stimuli. 5.8

Relationship of masking to test disparity: absolute or relative?

Having identiﬁed pronounced disparity selectivity in the effects of masking on a test at zero disparity, we may ask how the masking structure varies with the disparity of the test target. Two hypotheses arise.

105

C. W. Tyler 40 50’ 20

0 40

25’

20

Masking sensitivity

106

0 120

12.5’

80 40 0 80

6’

60 40 20 0 40

3’

20

0 –100 –80 –60 –40 –20

0

20

40

60

80

100

Mask half-disparity (arcmin)

Figure 5.6 Masking sensitivity as a function of masking disparity for the test bar at zero disparity, measured for a full range of bar widths as indicated (matched for test and mask). The thick lines on the left depict the consistent batwing form of the masking sensitivities, with a minimum near the test disparity. The thin lines in the right panels show the monocular proﬁles of each stimulus, to scale. The arrows depict the test disparity. From top to bottom, the mean standard errors were ±0.06, ±0.04, ±0.12, ±0.08, and ±0.05 decimal log units.

(1)

(2)

The masking is a unitary structure that is unaffected by the location of the test probe. The masking should then remain at the same absolute disparity range regardless of the test disparity. The masking structure is speciﬁc to the disparity of the test probe. The masking function should then shift with disparity to remain locked to the range deﬁned by the probe disparity.

The role of disparity interactions in perception of the 3D environment Disparity-masking functions were measured for test disparities set from −80 to +80 arcmin in 40 arcmin increments. For each test disparity, the mask disparity was varied to generate a masking function similar to those in previous graphs. The masking functions conformed to neither prediction alone, but showed aspects of both kinds of hypothesized behavior (Figure 5.7). The outer limits of the masking did not seem to track the disparity of the test, but they were essentially symmetrical for all test disparities. This stability implies that there is a generic component of the masking that is relatively invariant with test disparity, providing a broad bluff of masking over the full range of visible disparities (see Figure 5.7 again). In fact, there was a slight tendency for the

60

25⬘

40 20 0

Masking sensitivity

40 20 0 40 20 0 40 20 0 60 30 0 –100 –80 –60 –40 –20 0 20 40 60 80 100 Mask half-disparity (arcmin) Figure 5.7 Contrast-masking sensitivities as a function of the disparity of the masking bar for the 25 arcmin test bar at a range of different test disparities (arrows), each showing a minimum near the test disparity indicated by an arrow (together with other irregularities). The arrows at the bottom depict the test disparity, which varied from −80 to 80 arcmin disparity with 40 a arcmin step, with mean standard errors of ±0.01 to ±0.06 decimal log units (i.e., less than the size of the symbols). The inset at the right depicts the output of a conceptual model of the dual-process masking mechanism.

107

108

C. W. Tyler masking range to widen in the disparity range opposite to the test disparity, as though the presence of the test had a reciprocal inhibition on the range of the generic component. However, there is also a local disparity-speciﬁc masking effect that appears to be overlaid on this generic masking range. This takes the form of a double peak of enhanced masking at disparities of about 15 arcmin on either side of the test disparity. There also seems to be a tendency for the masking at the test disparity to be lower than that at the extremes of the generic component, just before it starts to descend to zero masking. Taken together, these effects imply that the disparity-speciﬁc masking takes the form of a wavelet of central disinhibition ﬂanked by peaks of enhanced masking. 5.9

Computational model

The behavior described up to this point may be captured in a computational model based on the following equations. The Keplerian space of Figure 5.1 may be expressed as a composite of position (x) and disparity (z) dimensions. When scaled in appropriate units, the left-eye parallels are given by l = x + z, and the right-eye parallels by r = z − x. The monocular masking behavior is modeled by a Gaussian along the l and r directions: 2

2

Mm = e−(l/σm ) + e−(r/σm ) .

(5.1)

The generic component of the disparity-domain masking plateau Mz is given by a variant of the Gaussian with a higher power: 8

Mz = e−(z/σz ) .

(5.2)

The disparity-speciﬁc component Md is a function keyed relative to the disparity of the test target, (symbolized by z), rather than to the convergence angle (the absolute disparity, z). In these terms, the disparity-speciﬁc component is a wavelet given by a difference of Gaussians keyed to the relative disparity, scaled with the stimulus width σs : 2

2

Md = e−(2 z/σs ) − e−(z/σs ) .

(5.3)

This idea of a combined generic masking plateau Mz and a disparity-speciﬁc wavelet Md is depicted in the right panel of Figure 5.7, where the wavelet of inhibition is superimposed on a stable base function and travels with the disparity of the test stimulus as implied by the arrow. The data support the idea of two peaks and a dip between them (evident at all spatial frequencies in the data

The role of disparity interactions in perception of the 3D environment of Figure 5.6, and all test disparities in Figure 5.7). This traveling dip represents a minimum in the masking function, in that there is always less masking when the mask is at the test disparity than when it is at adjacent disparities. The dip therefore implies a component of masking due to some facilitatory inﬂuence between mechanisms at neighboring disparities, rather than to local masking by two targets at the same disparity. The conceptual structure of the full scope of disparity interactions is derived as a weighted sum of the three masking effects described by Eq. (5.1)–(5.3): M = am Mm + az Mz + ad Md .

(5.4)

Thus, the full model consists of two local monocular masking effects, a broad generic disparity-masking effect, and a disparity-speciﬁc facilitatory wavelet effect. Equation (5.4) leaves open the issue of whether and how the scaling constants am,z,d depend on the contrast and polarity of the targets. The three components and the resultant masking pattern are illustrated in Figure 5.8 for the case of a same-polarity mask varying in disparity around the test location. The panels depict a Keplerian array in the format of Figure 5.1, with the visual axes for the two eyes schematized as lying along the two diagonals of the panels. The monocular masking effect takes place near the monocular retinal locations of the binocular mask and consequently produces a cross-like inhibition pattern in the Keplerian array along the lines of projection of the monocular projections for the two eyes (Figure 5.8, ﬁrst panel). The wavelet component produces two inhibitory areas, in front of and behind the mask (second panel). The generic binocular component spreads broadly across both depth and lateral positions (third panel). These three components combined

Monocular

Polarity-specific

Generic

Combined

Figure 5.8 Masking components in the Keplerian-array coordinates of Figure 5.1 (lateral position on the horizontal axis and near/far disparity on the vertical axis). The summed monocular, binocular polarity-speciﬁc, and generic binocular components of the masking are depicted separately in the respective panels. They combine to produce the signature “diaboloid” shape of the masking in the “combined” panel, which matches the empirical masking pattern (see Figure 5.5).

109

110

C. W. Tyler produce an inhibitory pattern of disparity interactions (fourth panel) similar to the empirical behavior shown in Figure 5.5. In addition to these two structural regularities, there seem to be some idiosyncrasies in the masking behavior for the uncrossed (positive) test disparities in Figure 5.7. At a test disparity of 10 arcmin, an additional minimum appears at −10 arcmin mask disparity. This feature is conﬁrmed more weakly in the −20 arcmin curve, but a symmetric dip does not appear in the cases of crossed (negative) test disparity.

5.10

Polarity specificity of disparity masking

The scaling parameters in Eq. (5.4) may be regarded as abstract scalars, but there is an alternative option, to treat them as predictions derived from the (signed) Weber contrast which would make the predicted function depend on the sign of the luminance contrast polarity. To test one aspect of this prediction, the contrast polarity of the mask target was inverted to be negative (dark) Gaussians in both eyes relative to the gray background, while the test target was retained as positive. The prediction is that any component of the masking function that is polarity-speciﬁc should invert. Conversely, if the masking is determined by the contrast energy at a particular location in space, no change should be expected in the masking function. The data clearly demonstrate that the use of a dark instead of a light mask does radically alter the masking function, although it does not generate facilitatory behavior in test detectability, as shown in Figure 5.9a. To a ﬁrst approximation, the batwing form of the test-disparity-selective component seems to be inverted, as depicted in the conceptual model of Figure 5.9b, in which the inversion matches the qualitative features of the dark-bar masking, whose maximum occurs near the (zero) test disparity. The regions of maximum same-sign masking, at about ±40 arcmin disparity, now exhibit local minima on either side of the peak in the opposite-sign masking effect, sharing many of the features of the conceptual model of a disparity-speciﬁc wavelet added to a polarity-invariant masking plateau (as in the inset to Figure 5.7). However, the plateau component of the masking model in Figure 5.9b is unchanged, implying that both the monocular and the generic components are proportional to the absolute rather than the signed Weber contrast (am ≈ |c|, az ≈ |c|). This initial assay reveals that a polarity-speciﬁcity-masking mechanism exists and may be employed in evaluating the multicomponent hypothesis of disparity masking. For a complete understanding of the contrast relationships, however, one would have to measure the same-sign masking for both light and dark stimuli, and for

The role of disparity interactions in perception of the 3D environment

Masking sensitivity

(a) 40

(b) 259

30 20 10 0 –80 –60 –40 –20 0 20 40 60 80 Half-disparity (arcmin)

Figure 5.9 Comparison of masking by a dark 25 arcmin bar (full curve) with that by a light 25 arcmin bar (dashed curve, reproduced from Figure 5.4 for test and mask with positive polarity). Note the inversion of the two main peaks and the central trough around a moderate masking level to generate two troughs ﬂanking the central peak. The arrows at the bottom depict the test disparity. The mean standard error was ±0.03 log units over the three to ﬁve measurements per point. (b) Conceptual model showing the effect of inverting the polarity of the disparity-speciﬁc wavelet added to a polarity-invariant masking plateau.

the opposite pair of opposite-sign stimuli, and do so as a function of both spatial frequency and contrast. 5.11

The nature of disparity masking

The original motivation for this masking study, as implied in the introduction (Section 5.1), was as a technique for evaluating the local channel structure of stereoscopic processing along the lines of the paradigm of Stromeyer and Julesz (1972) for luminance contrast or Stevenson et al. (1992) for dynamic noise disparity planes. However, masking in general is composed of at least two distinct processes: self-masking within a channel as the presence of the mask activates the channel and degrades its ability to respond to an additional stimulus, and inhibitory responses between channels as activation of one channel reduces the response in a neighboring channel by reciprocal inhibition. In spatial vision, the masking paradigm has generally revealed a simple structure that is readily interpretable in terms of self-masking within channels. The lateral masking effects in Figure 5.2a conform to the self-masking model, for example. At 5◦ eccentricity, the masking range extends about 1◦ around the lines of sight of the two eyes and 1–3◦ in disparity, depending on the size of the test stimuli. It is evident from the model structure shown in Figure 5.8, however, that there is a large degree of disparity-speciﬁc masking that cannot be explained by the masking of its monocular constituents. The pattern of the results for

111

112

C. W. Tyler local masks in the disparity domain reveals a complex structure of inhibitory interrelations among channels at different disparities. In particular, some of the inhibitory pathways, such as the generic masking plateau, are polarityindependent, while others, such as the disparity-speciﬁc disinhibitory wavelet, are polarity-speciﬁc. The masking is, in turn, speciﬁc to the position, disparity, size (spatial frequency), and contrast polarity of the mask. The resultant structure of disparity interactions is captured qualitatively by a computational model of these three masking components. Both computational and neurophysiological analyses will beneﬁt from building such masking behavior into future models of disparity encoding of depth information.

5.12

Relation to the 3D environment

In making ﬁxations on objects in the 3D environment, it is evident that the binocular system will be faced by a large array of features at various spatial positions relative to the feature being ﬁxated. The present results imply that any objects within the sector between the lines of sight of an object will degrade the ability to resolve its disparity. For solid objects, this issue does not usually arise, since they will be likely to occlude each other, so the relevant situation is when the objects are partially transparent, such as when one is looking in a glass-fronted shop window or looking through the surface of a lake. The present results are particularly relevant to situations where information is presented in this transparent form, such as in the “head-up” displays being developed for aircraft cockpits. The data imply that there are extended 3D masking effects in this situation, which are much stronger than those predicted from the local monocular masking effects. These interactions imply that, in complex 3D environments, we have to navigate a maze of 3D spatial interactions in the stereoscopic visual representation in order to make sense of the visual world. Acknowledgments This work was supported by AFOSR grant number FA9550-09-1-0678. References Chen, C. C. and Tyler, C. W. (1999). Spatial pattern summation is phase-insensitive in the fovea but not in the periphery. Spat. Vis., 12: 267–285. Chen, C. C. and Tyler, C. W. (2001). Lateral sensitivity modulation explains the ﬂanker effect in contrast discrimination. Proc. R. Soc. Biol. Sci., 268: 509–516.

The role of disparity interactions in perception of the 3D environment Chen C. C. and Tyler, C. W. (2006). Evidence for elongated receptive ﬁeld structure for mechanisms subserving stereopsis. Vis. Res., 46: 2691–2702. Chen, C. C. and Tyler, C. W. (2008) Excitatory and inhibitory interaction ﬁelds of ﬂankers revealed by contrast-masking functions. J. Vis., 8: 10.1–10.14. Field, D. J., Hayes, A., and Hess, R. F. (1993). Contour integration by the human visual system: evidence for a local “association ﬁeld”. Vis. Res., 33: 173–193. Foley J. M. (1994). Human luminance pattern-vision mechanisms: masking experiments require a new model. J. Opt. Soc. Am. A, 11: 1710–1719. Hess, R. F., Hayes, A., and Kingdom, F. A. (1997). Integrating contours within and through depth. Vis. Res., 37: 691–696. Hess, R. F., Hayes, A., and Field, D. J. (2003). Contour integration and cortical processing. J. Physiolol. (Paris), 97: 105–119. Julesz, B. (1971). Foundations of Cyclopean Perception. Chicago, IL: University of Chicago Press. Julesz, B. (1978). Global Stereopsis: Cooperative phenomena in Stereoscopic Depth Perception. Berlin: Springer. Kepler, J. (1611). Dioptrice. Augsburg: Vindelicorum. Kontsevich, L. L. and Tyler, C. W. (1999). Bayesian adaptive estimation of psychometric slope and threshold. Vis. Res., 39: 2729–2737. Kontsevich, L. L. and Tyler, C. W. (2004). Local channel structure of sustained peripheral vision. In B. E. Rogowitz and T. N. Pappas (eds.), Human Vision and Electronic Imaging IX, pp. 26–33. SPIE, 5292, Bellingham, WA: SPIE. Kontsevich, L. L. and Tyler, C. W. (2005). The structure of stereoscopic masking: position, disparity, and size tuning. Vis. Res., 43: 3096–3108. Kulikowski, J. J. and King-Smith, P. E. (1973). Spatial arrangement of line, edge and grating detectors revealed by subthreshold summation. Vis. Res., 13: 1455–1478. Levi, D. M., Klein, S. A., and Hariharan, S. (2002). Suppressive and facilitatory spatial interactions in foveal vision: foveal crowding is simple contrast masking. J. Vis., 2: 140–166. Mansﬁeld, J. S. and Parker, A. J. (1993). An orientation-tuned component in the contrast masking of stereopsis. Vis. Res., 33: 1535–1544. Marr, D. (1982). Vision. San Francisco, CA: W. H. Freeman. Ogle, K. N. (1950). Researches in Binocular Vision. Philadelphia, PA: Saunders. Polat, U. and Sagi, D. (1993). Lateral interactions between spatial channels: suppression and facilitation revealed by lateral masking experiments. Vis. Res., 33: 993–999. Polat, U. and Tyler, C. W. (1999). What pattern the eye sees best. Vis. Res., 39: 887–895. Stevenson, S. B., Cormack, L. K., Schor, C. M., and Tyler, C. W. (1992). Disparity tuning in mechanisms of human stereopsis. Vis. Res., 32: 1685–1694. Stiles, W. S. (1939). The directional sensitivity of the retina and the spectral sensitivities of the rods and cones. Proc. R. Soc. (Lond.) B, 127: 64–105. Stromeyer, C. F. and Julesz, B. (1972). Spatial-frequency masking in vision: critical bands and spread of masking. J. Opt. Soc. Am., 62: 1221–1232.

113

114

C. W. Tyler Tyler, C. W. (1973). Stereoscopic vision: cortical limitations and a disparity scaling effect. Science, 181: 276–278. Tyler, C. W. (1975). Characteristics of stereomovement suppression. Percept. Psychophys., 17: 225–230. Tyler, C. W. (1983). Sensory processing of binocular disparity. In C. Schor and K. J. Ciuffreda (eds.), Basic and Clinical Aspects of Binocular Vergence Eye Movements, pp. 199–295. London: Butterworth. Tyler, C. W. (1991). Cyclopean vision. In D. Regan (ed.), Binocular Vision. Vol. 9 of Vision and Visual Disorders, pp. 38–74. New York: Macmillan. Tyler, C. W. (2005). Spatial form as inherently three-dimensional. In M. Jenkin and L. R. Harris (eds.), Seeing Spatial Form. Oxford: Oxford University Press.

6

Blur and perceived depth martin s. banks and robert t. held

6.1

Introduction

Estimating the three-dimensional (3D) structure of the environment is challenging because the third dimension – depth – is not directly available in the retinal images. This “inverse optics problem” has for years been a core area in vision science (Berkeley, 1709; Helmholtz, 1867). The traditional approach to studying depth perception deﬁnes “cues” – identiﬁable sources of depth information – that could in principle provide useful information. This approach can be summarized with a depth-cue taxonomy, a categorization of potential cues and the sort of depth information they provide (Palmer, 1999). The categorization is usually based on a geometric analysis of the relationship between scene properties and the retinal images they produce. The relationship between the values of depth cues and the 3D structure of the viewed scene is always uncertain, and the uncertainty has two general causes. The ﬁrst is noise in the measurements by the visual system. For example, the estimation of depth from disparity is uncertain because of internal errors in measuring retinal disparity (Cormack et al., 1997) and eye position (Backus et al., 1999). The second cause is the uncertain relationship between the properties of the external environment and retinal images. For example, the estimation of depth from aerial perspective is uncertain because various external properties – for example, the current properties of the atmosphere and the illumination and reﬂectance properties of the object – affect the contrast, saturation, and hue of the retinal image (Fry et al., 1949). It is unclear how to incorporate those uncertainties into the classical geometric model of depth perception. Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

115

116

M. S. Banks and R. T. Held This classical approach is being replaced with one based on statistical reasoning: Bayesian inference. This approach gracefully incorporates both uncertainty due to internal noise and uncertainty due to external properties that are unknown to the observer. Here we use the Bayesian framework to explore the information content of an underappreciated source of depth information: the pattern of blur across the retinal image. The work is described in greater detail in Held et al. (2010). We will show that blur does not directly signal depth, so from that cue alone, the viewer could not rationally determine the 3D structure of the stimulus. However, when blur is presented along with other information that is normally present in natural viewing of real scenes, its potential usefulness becomes apparent. 6.2

Background

Our subjective impression of the visual world is that everything is in focus. This impression is reinforced by the common practice in photography and cinematography of creating images that are in focus everywhere (i.e., images with inﬁnite depth of focus). Our subjective impression, however, is quite incorrect because, with the exception of the image on the fovea, most parts of the retinal image are signiﬁcantly blurred. Here we explore how blur affects depth perception. Previous research on blur as a depth cue has produced mixed results. For instance, Mather and Smith (2000) found that blur had little effect on perceived depth when it was presented with binocular disparity. These authors manipulated blur independently of disparity and measured the perceived distance separation between two regions. The most interesting condition was one in which a central square was presented with crossed (near) disparity relative to the background. In one case, the square and background were sharp, and in another case, the square was sharp and the background blurred by an amount consistent with the disparity-speciﬁed depth interval between the square and the background. The latter condition is illustrated in Figure 6.1. Blurring the background had no effect on the perceived depth interval. Indeed, Mather and Smith observed no effect of blur unless it was greatly exaggerated and therefore speciﬁed a much larger separation than the disparity did. Watt et al. (2005) found no effect of blur on the perceived slant of disparityspeciﬁed planes (although they did ﬁnd an effect with perspective-deﬁned planes viewed monocularly). Other investigators have stated that blur has no discernible effect on depth percepts when reliable cues such as structure from motion are available (e.g., Caudek and Profﬁtt, 1993; Hogervorst and Eagle, 1998, 2000).

Blur and perceived depth

Figure 6.1 Stereogram with disparity and blur. The stimulus consists of a central square and a background. Here the square contains a sharp texture and the background a blurred texture. When the images are cross-fused, disparity speciﬁes that the square is in front of the background. (Adapted from Mather and Smith, 2000).

There are, however, convincing demonstrations that blur can provide distance-ordering information. Consider two adjoining regions, one with a blurred texture and one with a sharp texture (Figure 6.2). When the border between the regions is sharp (Figure 6.2, left), it seems to belong to the sharply textured region, and people tend to see that region as nearer and as occluding a blurred textured background. When the border is blurred (Figure 6.2, right), it appears to belong to the region with a blurred texture, and people perceive that region as nearer and as occluding a sharply textured background (Marshall et al., 1996; Mather, 1996, 1997; O’Shea et al., 1997; Mather and Smith, 2002; Palmer and Brooks, 2008). From these results and others, vision scientists have concluded that blur is at best a weak depth cue. Mather and Smith (2002), for example, stated that blur acts as “a relatively coarse, qualitative depth cue” (p. 1211). Despite the widely held view that blur is a weak depth cue, there is also evidence that it can affect the perception of metric distance and size. For example, cinematographers make miniature models appear much larger by reducing the camera aperture and thereby reducing the blur variation throughout the scene (Fielding, 1985). The opposite effect is created by the striking photographic manipulation known as the tilt–shift effect: a full-size scene is made to look much smaller by adding blur with either a special camera or

117

118

M. S. Banks and R. T. Held

Figure 6.2 Demonstration of border blur and depth interpretation. There are two textured regions in each panel, one sharp and one blurred. Left: the border is sharp, so it appears to “belong” to the region on the left with sharp texture. Most people see the sharp region as the occluder and the blurred region as part of a partially occluded background. Right: the border is blurred, so it appears to belong to the region on the right with blurred texture. Most now see the blurred region as the occluder. (Adapted from Marshall et al., 1996.)

Figure 6.3 Blur and perceived size and distance. The left image was rendered with a pinhole aperture. There is little blur, and the apparent sizes of the buildings are large. The right image was rendered with a virtual 60 m aperture. There is considerable blur, and the apparent sizes of the buildings are noticeably smaller than in the left image. (The original city images and data from Google Earth are copyright TerraMetrics, Sanborn, and Google.) See http://www.tiltshiftphotography. net/ for other images. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

Blur and perceived depth post-processing software (Laforet, 2007; Flickr, 2009; Held et al., 2010). In photography, the enlarging associated with reduced blur and the miniaturization associated with added blur are called depth-of-ﬁeld effects (Kingslake, 1992). The images in Figure 6.3 demonstrate miniaturization. The left image is a photograph of an urban scene taken with a small aperture. The right image has been rendered with a blur pattern consistent with a shorter focal distance and therefore with a smaller scene; the added blur makes it look like a scale model. This and related demonstrations show that blur can have a profound effect on perceived size, which is presumably caused by an effect on perceived absolute distance. In summary, the literature presents a mixed picture of whether blur is a useful cue to depth. To resolve the conﬂict, we turn to blur’s optical origins and examine how it could in principle be used to estimate distance. 6.3

Probabilistic modeling of blur as a distance cue

In an ideal lens with a focal length f , the light rays emanating from a point at some near distance z in front of the lens will be focused to another point on the opposite side of the lens at a distance s, where the relationship between these distances is given by the thin-lens equation, 1 1 1 + = . s z f

(6.1)

If the image plane is at a distance s0 behind the lens, then light emanating from features at a distance z0 = 1/((1/f ) − (1/s0 )) will be focused on the image plane (Figure 6.4). The plane at a distance z0 is the focal plane, so z0 is the focal distance of the imaging device. Objects at other distances will be out of focus and hence will generate blurred images in the image plane. We can express the amount of blur by the diameter b of the blur circle. For an object at a distance z1 , z0 As0 1 − , b= z0 z1 where A is the diameter of the aperture. It is convenient to substitute r for the distance ratio z1 /z0 , yielding b=

1 As0 . 1 − z0 r

(6.2)

Real imaging devices such as the eye have imperfect optics and more than one refracting element, so Eq. (6.2) is not strictly correct. Later, we shall describe those effects and show that they do not affect our analysis.

119

120

M. S. Banks and R. T. Held

z0 z1

A s0

b Figure 6.4 The optics and formation of blurred images. The box represents a camera with aperture diameter A. The camera lens forms an image in the ﬁlm plane at the back of the box. Two objects, at distances z0 and z1 , are presented with the camera focused at z0 , so z1 creates a blurred image of width b.

If an object is blurred, is it possible to recover its distance from the viewer? To answer this, we examine the implications of Eq. (6.2). Now the aperture A is the diameter of a human pupil and z0 is the distance to which the eye is focused. Figure 6.5 shows the probability of z0 and r for a given amount of blur, assuming A is 4.6 mm (±1 mm; Spring and Stiles, 1948). For each blur magnitude, an inﬁnite number of combinations of z0 and r are possible. The distributions for large and small blur differ: large blur diameters are consistent with a range of short focal distances, and small blur diameters are consistent with a range of long distances. Nonetheless, one cannot estimate focal distance or the distance ratio from one observation of blur or from a set of such observations. How, then, does the change in perceived size and distance in Figure 6.3 occur? The images in Figure 6.3 contain other pictorial cues that specify the distance ratios between objects in the scene. Such cues are scale-ambiguous, with the possible exception of familiar size, so they cannot directly signal the absolute distances to objects. Here we will focus on perspective cues (e.g., linear perspective, texture gradient, and relative size), which are commonplace in natural scenes. We can determine absolute distance from the combination of

Blur and perceived depth 100

Focal distance (m)

Retinal blur = 0.01° 10 0.1° 1 1.0° 0.1

Pupil diameter: mean = 4.6 mm s.d. = 1.0 mm

0.01 0.1

0.3

1 Relative distance

3

10

Figure 6.5 Focal distance as a function of retinal-image blur and relative distance. The relative distance is the ratio of the distance to an object and the distance to the focal plane. The three curves represent different amounts of image blur, expressed as the diameter of the blur circle, b. The variance in the distribution was determined by assuming that the pupil diameter was Gaussian-distributed with a mean of 4.6 mm and a standard deviation of 1 mm (Spring and Stiles, 1948). We assume that the observer has no knowledge of the current pupil diameter. For a given amount of blur, it is impossible to recover the original focal distance without knowledge of the relative distance. Note that as the relative distance approaches 1, the object moves closer to the focal plane. There is a singularity at a relative distance of 1 because the object is, by deﬁnition, completely in focus. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/ 9781107001756).

blur and perspective. To do this, we employ Bayes’ Rule. Assuming conditional independence, P(z0 , r|b, p) =

P(b|z0 , r)P(p|z0 , r)P(z0 , r) , P(b, p)

(6.3)

where p is the perspective information, b is the blur-circle diameter, r is the distance ratio, and z0 is the absolute distance to which the eye is focused. By regrouping the terms in a fashion described by Burge et al. (2010), we obtain 1 P(b|z0 , r)P(z0 , r|p). (6.4) P(z0 , r|b, p) = P(b, p)/P(p) It would be more useful to have the distributions expressed in terms of common units, such that the posterior can be obtained via pointwise multiplication. Here, we would like to have both terms on the right expressed in terms of the absolute distance and the distance ratio z0 and r, respectively. Fortunately,

121

M. S. Banks and R. T. Held there is a deterministic relationship between the focal distance, the distance ratio, and the retinal blur due to defocus. This relationship, given by Eq. (6.2) can be inverted to map the blur back into the focal distance and the distance ratio: z0 , r = f −1 (b). The distributions in Figure 6.5 show the possible values for z0 and r, given a retinal blur b, Now we can express the two probability distributions in Eq. (6.4) in terms of the focal distance and the distance ratio (because f −1 maps blur into z0 and r): P(z0 , r|b, p) =

1 P(f −1 (b)|z0 , r)P(z0 , r|p). P(b, p)/P(p)

(6.5)

This produces the right panel in Figure 6.6. The left panel is the distribution of absolute distances and distance ratios given the observed blur. The middle panel is the distribution of distances given the observed perspective. And the right panel is the combined estimate based on both cues. Notice that the absolute distance and the distance ratio are now reasonably well speciﬁed by the image data. We can use the distribution in Figure 6.6c to make depth estimates.

Depth-from-blur distribution

Depth-from-perspective distribution

Combined depth estimate

100 Focal distance (m)

122

10

Retinal blur = 0.1°

1

X

=

0.1 Pupil diameter: mean = 4.6mm 0.01 s.d. = 1.0mm 0.1

0.3

1

3

10

0.1

3 0.3 1 Relative distance

(a)

(b)

10

0.1

0.3

1

3

10

(c)

Figure 6.6 Bayesian analysis of blur as an absolute cue to distance. (a) The probability distribution P(z0 , r|b), where b is the observed image blur diameter (in this case, 0.1◦ ), z0 is the focal distance, and r is the distance ratio (z1 /z0 ,

object/focal). One cannot estimate the absolute or the relative distance to points in the scene from observations of their blur. (b) The probability distribution P(z0 , r|p), where p is the observed perspective. The perspective speciﬁes the relative but not the absolute distance: it is scale-ambiguous. (c) The product of the distributions in (a) and (b). From this posterior distribution, the absolute and relative distances of points in the scene can be estimated.

Blur and perceived depth Figure 6.6 illustrates the Bayesian model for estimating distance from blur. We have made some simplifying assumptions, all of which can be avoided as we learn more. First, we have assumed that the visual system’s capacity to estimate depth from blur is limited only by the optics of retinal-image formation. Of course, the visual system must measure the blur from the image, and this measurement is subject to error (Walsh and Charman, 1988; Mather and Smith, 2002; Georgeson et al., 2007). If we were to include this observation, the blur distributions in Figures 6.5 and 6.6a would have larger variance than those shown. This would decrease the precision of estimation, but should not affect the accuracy. The measurement of blur has been investigated in the literature on computer and biological vision (Pentland, 1987; Farid and Simoncelli, 1998; Mather and Smith, 2000; Georgeson et al., 2007). Naturally, the interpretation of a blur measurement depends on the content of the scene, because the same blur can be observed at the retina when a sharp edge in the world is viewed with inaccurate focus and when a blurred edge (e.g., a shadow border) is viewed with accurate focus. Second, we have assumed a representative variance for the measurement of perspective, but this would certainly be expected to vary signiﬁcantly across different types of images. The measurement of perspective has also been extensively investigated in the literature on computer and biological vision (Brillault-O’Mahony, 1991; Knill, 1998; Coughlan and Yuille, 2003; Okatani and Deguchi, 2007). The ability to infer distance ratios from perspective depends of course on the content of the scene. If the scene consists of homogenous, rectilinear structure, such as the urban scene in Figure 6.3, the variance of P(zo , r|p) is small and distance ratios can be estimated accurately. If the scene is devoid of such structures, the variance of the distribution is larger and ratio estimation more prone to error. As we can see from Figure 6.6, high variance in the perspective distribution can compromise the ability to estimate absolute distance from the two cues. We predict, therefore, that altering perceived size by manipulating blur will be more effective in scenes that contain rich perspective than in scenes with weak perspective. Third, we have assumed that the other pictorial cues provide distance-ratio information only. In fact, images also contain the cue of familiar size, which conveys some absolute-distance information. We could have incorporated this into the theory by modifying the perspective distribution in Figure 6.6b; the distribution would become a 2D Gaussian with different variances horizontally and vertically. We chose not to add this feature for simplicity and because we have little idea of what the relative horizontal and vertical variances would be. It is interesting to note, however, that including familiar size might help explain anecdotal observations that the miniaturization effect is hard to obtain in some images.

123

124

M. S. Banks and R. T. Held Fourth, we have represented the optics of the eye with an ideal lens, free of aberrations. Image formation by real human eyes is affected by diffraction due to the pupil, at least for pupil diameters of 2–3 mm or smaller, and is also affected by a host of higher-order aberrations, including coma and spherical aberration, at larger pupil diameters (Wilson et al., 2002). Incorporating diffraction and higher-order aberrations into our calculations in Figure 6.6a would yield greater retinal-image blur than that shown for distances at or very close to the focal distance: the trough in the blur distribution would be deeper, but the rest of the distribution would be unaffected. As a consequence, the ability to estimate absolute distance would be unaffected by incorporating diffraction and higherorder aberrations as long as a sufﬁciently large range of distance ratios was available in the stimulus. Fifth, we have assumed for the purposes of the experiment and modeling described below that the eye’s optics are ﬁxed when subjects view photographs. Of course, the optical power of the eye varies continually owing to accommodation (adjustments of the shape of the crystalline lens), and the muscle command sent to the muscles controlling the shape of the lens is a cue to distance, albeit a highly variable and weak one (Wallach and Norris, 1963; Fisher and Ciuffreda, 1988; Mon-Williams and Tresilian, 1999). When real scenes are viewed, accommodation turns blur into a dynamic cue that may allow the visual system to glean more distance information than we have assumed. The inclusion of accommodation in our model would have had little effect on our interpretation of the demonstration in Figure 6.3 or our psychophysical experiment because the stimuli are photographs, so the changes in the retinal image as the eye accommodated did not mimic the changes that occur with real scenes. The inclusion of accommodation would, however, deﬁnitely affect the use of blur with real scenes. We intend to pursue the use of dynamic blur and accommodation using volumetric displays that yield a reasonable approximation to the relationship for real scenes (e.g., Akeley et al., 2004; Love et al., 2009). 6.4

Predictions of the model

The model predicts that the visual system estimates absolute distance by ﬁnding the focal distance that is most consistent with the blur and perspective in a given image. If the blur and perspective are consistent with one another, accurate and precise distance estimates can be obtained. If they are inconsistent, the estimates will be generally less accurate and precise. We have examined these predictions by considering images with three types of blur: (1) blur that is completely consistent with the distance ratios in a scene, (2) blur that is mostly correlated with the distances, and (3) blur that is uncorrelated with the distances. Fourteen scenes from Google Earth were

Blur and perceived depth used. Seven had a large amount of depth variation (skyscrapers) and seven had little depth variation (one- to three-story buildings). The camera was placed 500 m above the ground and oriented downwards by 35◦ from Earth-horizontal. The average distance from the camera to the buildings in the center of each scene was 785 m. We used a standard depth-of-ﬁeld rendering approach (Haeberli and Akeley, 1990) to create blur consistent with different scales of the Google Earth locales. We captured several images of the same locale from positions on a jittered grid covering a circular aperture. We translated each image to ensure that objects in the center of the scene were aligned from one image to another and then averaged the images. The diameters of the simulated camera apertures were 60.0, 38.3, 24.5, 15.6, and 10.0 m. These unusually large apertures were needed to produce blur consistent with what a human eye with a 4.6 mm pupil would receive when focused at 0.06, 0.09, 0.15, 0.23, and 0.36 m, respectively. Figures 6.7b and c show example images with simulated 24.5 and 60 m apertures. If the object is a plane, the distances in the image will form a gradient that runs along a line in the image. The pattern of blur is also a gradient that runs in the same direction (McCloskey and Langer, 2009). If we were to add the appropriate sign to the blur (i.e., remove the absolute-value sign in Eq. (6.2)), the blur gradient would also be linear (i.e., b would be proportional to height in the image). Larger gradients are associated with a greater slant between the object and image planes. The Google Earth scenes, particularly the ones with low depth variation, are approximately planes with the distance gradient running from bottom to top in the image. Thus, to create the second blur condition, we applied a linear blur gradient (except for the sign) running in the same direction in the image as the distance gradient. This choice was motivated in part by the fact that most of the miniaturized images available online were created by applying linear blur gradients in the postprocessing (Flickr, 2009). To create the third blur condition, we applied a horizontal blur gradient to images in which the distance gradient was vertical. Thus, the blur gradient was orthogonal to the distance gradient and therefore the blur was not highly correlated with distance in the scene. Each row or column of pixels (for vertical or horizontal gradients, respectively) was assigned a blur magnitude based on its position along the gradient. Each row or column was then blurred independently using a cylindrical blur kernel of the appropriate diameter. Finally, all of the blurred pixels were recombined to create the ﬁnal image. The maximum amounts of blur in the horizontal and vertical gradients were assigned to the average blur magnitudes along the top and bottom of the consistent-blur images. Thus, the histograms of the blur magnitude were roughly equal across the three types of blur manipulation. Figures 6.7d–g are examples of the resulting images.

125

M. S. Banks and R. T. Held No blur (a)

Vertical blur gradient

Consistent blur (d)

Horizontal blur gradient (f)

(c)

(e)

(g)

Simulated focal distance = 0.15 m

(b)

Simulated focal distance = 0.06 m

126

Figure 6.7 Four types of blur used in the analysis and experiment: (a) no blur, (b) and (c) consistent blur, (d) and (e) linear vertical blur gradient, and (f ) and (g) linear horizontal blur gradient. Simulated focal distances of 0.15 m ((b), (d), and (f )) and 0.06 m ((c), (e), and (g)) are shown. In approximating the blur produced by a short focal length, the consistent-blur condition produces the most accurate blur, followed by the vertical gradient, the horizontal gradient, and the no-blur condition. (The original city images and data from Google Earth are copyright Terrametrics, SanBorn, and Google.) A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

We applied the model to images with the three types of blur manipulation. We ﬁrst selected pixels in image regions containing contrast by employing the Canny edge detector (Canny, 1986). The detector’s parameters were set such that it found the subjectively most salient edges. We later veriﬁed that the choice of parameters did not affect the model’s predictions. We then took the distance ratios in the scene from the video card’s z-buffer while running Google

Blur and perceived depth

100

Vertical blur gradient

Consistent blur (a)

Horizontal blur gradient (c)

(b)

10 1.0 0.1 0.01 0.001

0.6

1

2

0.6 1 Relative distance

2

0.6

1

2

Distribution of estimated focal distances

Absolute distance (m)

Earth. These recovered distances constituted the depth map. The human visual system, of course, must estimate the blur from images, which adds noise to the blur measurement and, accordingly, to the depth estimate. Thus, the model presumably overestimates the precision of depth estimation from blur. For the consistent-blur condition, the depth map was used to calculate the blur applied to each pixel. For the incorrect blur conditions, the blur for each pixel was determined by the gradients applied. To model human viewers, we assumed A = 4.6 mm and s0 = 17 mm. We assumed that observers use perspective to estimate r for each pixel relative to the pixels in the best-focused region. We then used Eq. (6.2) to calculate z0 and z1 for each pixel. All of the estimates were combined to produce a marginal distribution of estimated focal distances. The median of the distribution was the ﬁnal estimate of absolute distance. Figure 6.8a shows the focal-distance estimates based on the blur and distanceratio data from the consistent-blur image in Figure 6.7c. Because the blur was rendered correctly for the distance ratios, all of the estimates indicate

Figure 6.8 Determining the most likely focal distance from blur and perspective. The intended focal distance was 0.06 m. Each panel plots the estimated focal distance as a function of distance ratio. The left, middle, and right panels show the estimates for consistent blur, a vertical blur gradient, and a horizontal blur gradient, respectively. The ﬁrst step in the analysis was to extract the relative-distance and blur information from several points in the image. The values for each point were then used with Eq. (6.2 ) to estimate the focal distance. Each estimate is represented by a point. Then, all of the focal-distance estimates were accumulated to form a marginal distribution of estimates (shown on the right of each panel). The data from a consistent-blur rendering match the selected curve very closely, resulting in an extremely low variance. Though the vertical blur gradient incorrectly blurs several pixels, it is well correlated with the depths in the scene, so it too produces a marginal distribution with low variance. The blur applied by the horizontal gradient is mostly uncorrelated with depth, resulting in a marginal distribution with large variance and therefore the least reliable estimate. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

127

128

M. S. Banks and R. T. Held the intended focal distance of 0.06 m. Therefore, the marginal distribution of estimates has a very low variance and the ﬁnal estimate is accurate and precise. Figure 6.8b plots the blur/distance-ratio data from the vertical-blur image in Figure 6.7e. The focal-distance estimates now vary widely, though the majority lie close to the intended value of 0.06 m. The model predicts that vertical blur gradients should inﬂuence estimates of focal distance but in a less compelling and consistent fashion than consistent blur does. Scenes with larger depth variation (not shown here) produced marginal distributions with a higher variance. This makes sense, because the vertical blur gradient becomes a closer approximation to consistent blur as the scene becomes more planar. Figure 6.8c plots the blur/relative-distance data from the horizontal-blur image in Figure 6.7g. In horizontal-gradient images, the blur is mostly uncorrelated with the relative depth in the scene, so the focal-distance estimates are scattered. While the median of the marginal distribution is similar to the values obtained with consistent blur and the vertical gradient, the variance of the distribution is much greater. The model predicts, therefore, that a horizontal gradient will have the smallest inﬂuence of the three blur types on perceived distance. 6.5

Psychophysical experiment on estimating absolute distance from blur

We next compared human judgments of perceived distance with our model’s predictions. We used the same blur-rendering techniques to generate stimuli for the psychophysical experiment: consistent blur, vertical blur gradient, and horizontal blur gradient. An additional stimulus was created by rendering each scene with no blur. The stimuli were generated from the same 14 Google Earth scenes as those on which we conducted the analysis shown in Figure 6.8. The seven observers were unaware of the experimental hypotheses. They were positioned with a chin rest 45 cm from a CRT monitor and viewed the stimuli monocularly. Each stimulus was displayed for 3 s. Observers were told to look around the scene in each image to get an impression of its distance and scale. After each stimulus presentation, observers gave an estimate of the distance from a marked building in the center of the scene to the camera that produced the image. There were 224 unique stimuli, and each stimulus was presented seven times in random order for a total of 1568 trials. Figure 6.9 shows the results averaged across observers. The abscissa represents the simulated focal distance (the focal distance used to generate the blur in the consistent-blur condition); the values for the vertical and horizontal blur gradients are those that yielded the same maximum blur magnitudes as in

Blur and perceived depth All subjects Reported distance (normalized)

1.0

0.1

0.01

0.001 0

0.1

0.2

0.3

0.4

Simulated focal distance (m) Figure 6.9 Results of the psychophysical experiment averaged across all seven subjects. The abscissa is the simulated focal distance. The ordinate is the normalized reported distance; each reported distance was divided by the observer’s reported distance when the image contained no blur. The type of blur manipulation is indicated by the colors and shapes of the data points: squares for consistent blur, circles for a vertical blur gradient, and triangles for a horizontal blur gradient. The abscissa values for the vertical- and horizontal-gradient conditions were determined by matching the blur histograms with the respective consistent-blur condition. The error bars represent standard errors. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

the consistent-blur condition. The ordinate represents the average reported distance to the marked object in the center of the scene divided by the average reported distance for the no-blur control condition. Lower values mean that the scene was seen as closer, and therefore presumably smaller. All observers exhibited a statistically signiﬁcant effect of blur magnitude, which indicated that the marked object appeared smaller when the blur was large. The effect of blur magnitude was much larger for the consistent-blur and vertical-blur-gradient conditions than for the horizontal-gradient condition, so there was a signiﬁcant effect of blur type. The fact that the horizontal-gradient condition produced little effect means that blur per se does not affect perceived distance; rather, perceived distance is affected by blur when the pattern of blur is consistent with the relative distances speciﬁed by other cues. The results show that perceived absolute distance is inﬂuenced by the pattern and magnitude of blur just as the model predicts. Consistent blur and verticalgradient blur yield systematic and predictable variations in perceived distance. Horizontal-gradient blur yields a much less systematic variation in perceived

129

130

M. S. Banks and R. T. Held distance. Thus, two cues – blur and perspective – that by themselves do not convey absolute distance information can be used in combination to make reliable estimates of absolute distance and size. 6.6

Reconsidering blur as a depth cue

As we mentioned earlier, previous work on humans’ ability to extract depth from blur has concluded that blur is at best a coarse, qualitative cue providing no more than ordinal depth information. Can the probabilistic model presented here explain why previous work revealed no quantitative effect of blur? Some investigators have looked for an effect of blur when disparity was present. Mather and Smith (2000) found no effect of blur in a distance-separation task except when blur speciﬁed a much larger separation than did disparity, and even then the effect was quite small. Watt et al. (2005) found no effect of blur in a slant-estimation task when disparity was present. To reexamine these studies, it is useful to note that disparity and blur share a fundamental underlying geometry. Disparity is the difference in image positions created by the differing vantage points of the two eyes. Defocus blur is caused by the difference in image positions created by the differing vantage points of different parts of the pupil. In other words, they both allow an estimation of depth from triangulation. We can formalize this similarity by rewriting Eq. (6.2), which relates the diameter of the blur circle in the image to the viewing situation. Replacing the aperture A with the interocular distance I, we obtain 1 Is0 , (6.6) 1− d= z0 r where d is the horizontal disparity. By combining the above equation with Eq. (6.2), we obtain b A = , |d| I

(6.7)

which means that the blur and the horizontal disparity are affected by the same viewing parameters (e.g., viewing distance and the need for scaling), but that disparities are generally much larger than blur circles because A/I ∼1/12 for most viewing situations. We can represent the disparity likelihood function as P(d|z0 , r), where d is the disparity, if we incorporate the signals required for disparity scaling and correction into the mapping function (Gårding et al., 1995; Backus et al., 1999). In the format of Figure 6.6, the disparity and eyeposition signals together specify both distance ratio and absolute distance.

Blur and perceived depth Numerous psychophysical experiments have shown that very small depths can be discerned from disparity (Blakemore, 1970), much smaller than what can be discriminated from changes in blur (Walsh and Charman, 1988). For this reason, the variance of the depth-from-disparity distribution is generally small, so the product distribution is affected very little by the blur. The theory, therefore, predicts little if any effect of blur when a depth step is speciﬁed by disparity, and this is consistent with the experimental observations of Mather and Smith (2000) and Watt et al. (2005). This does not mean, of course, that blur is not used as a depth cue; it just means that the disparity estimate is often so low in variance that it dominates in those viewing conditions. One would have to make the disparity signal much noisier to test for an effect of blur. It is interesting in this regard to note that disparity discrimination thresholds worsen much more rapidly in front of and behind the ﬁxation point (Blakemore, 1970) than blur discrimination thresholds do (Walsh and Charman, 1988). Because of the difference in how discrimination thresholds worsen away from the ﬁxation point, blur might well be a more reliable depth signal than disparity for points in front of and behind where one is looking. Other investigators have examined the use of blur in distance ordering. When the stimulus consists of two textured regions with a neutral border between them, the depth between the regions is not speciﬁed. However, when the texture in one region is blurrier than the texture in the other, the blur of the border determines the perceived distance ordering (Marshall et al., 1996; Mather, 1996; Palmer and Brooks, 2008). The border appears to belong to the region whose texture has similar blur, and that region is then seen as ﬁgure and the other region as ground. We can represent this relationship between border and region blur with a probability distribution: the distribution has zero probability for all distance ratios less than 1 and nonzero probability for all distance ratios greater than 1 (or the opposite depending on whether the inter-region border is blurred or sharp). The product of this distribution with the distribution derived from region blur (left panel, Figure 6.6) then reduces to one of the two wings of the blur function, and this speciﬁes the distance ordering, which is consistent with the previous observations (Marshall et al., 1996; Mather, 1996; Palmer and Brooks, 2008). Figure 6.10 demonstrates that the pattern of blur in an image can affect the interpretation of an ambiguous shape. The left and right images are photographs of the same wire-frame cube with a pencil running through it. In the left image, the camera was focused on the near vertex of the cube. In this case, the cube and pencil are readily perceived in their correct orientations. In the right image, the camera was focused on the cube’s far vertex. Now the cube tends to be misperceived; there is a tendency to see the sharply focused vertex

131

132

M. S. Banks and R. T. Held

Figure 6.10 Demonstration of the role of blur in resolving perceptual ambiguity. The left and right images are photographs of the same wire-frame cube with a pencil running through it. In the left image, the camera was focused on the cube’s near vertex. In the right image, the camera was focused on the cube’s far vertex. Notice that the right image is more difﬁcult to interpret than the left one. These photographs were provided courtesy of Jan Souman.

as near and the blurred vertices as farther away, and when that occurs, the pencil’s apparent orientation relative to the cube is no longer sensible. Presumably, the illusion in the right image is caused by a bias towards inferring that sharply focused points are near and defocused points are far away. This demonstration shows that blur can have a signiﬁcant effect on the interpretation of otherwise ambiguous stimuli. We conclude that previous claims that blur is a weak depth cue providing only coarse ordinal information are incorrect. When the depth information contained in blur is represented in the Bayesian framework, we can see that it provides useful information about metric depth when combined with information from nonmetric depth cues such as perspective. Many depth cues, such as aerial perspective (Fry et al., 1949; Troscianko et al., 1991) and shading (Koenderink and van Doorn, 1992), are regarded as nonmetric because one cannot uniquely determine depths in the scene from the cue value alone. This view stems from considering only the geometric relationship between retinal-image features associated with the cue and depth: with these cues, there is no deterministic relationship between relevant image features and depth in the scene. We propose instead that the visual system uses the information in those cues probabilistically, much as it uses blur. From this, it follows that all depth cues have the potential to affect metric depth estimates

Blur and perceived depth as long as there is a nonuniform statistical relationship between the cue value and depth in the environment. Consider, for example, contrast, saturation, and brightness. O’Shea et al. (1994) showed that the relative contrast of two adjoining regions affects perceived distance ordering: the region with higher contrast is typically perceived as nearer than the low-contrast region. Egusa (1983) and Troscianko et al. (1991) showed that desaturated images appear farther away than saturated images. The retinal-image contrast and saturation associated with a given object are affected by increasing atmospheric attenuation with distance in natural scenes (Fry et al., 1949). Thus, these two perceptual effects are undoubtedly due to the viewer incorporating statistics relating contrast, saturation, and absolute distance. Brightness affects perceived distance ordering as well (Egusa, 1983): brighter regions are seen as nearer than darker ones. This effect is also understandable from natural-scene statistics because dark regions tend to be farther away from the viewer than bright regions (Potetz and Lee, 2003). We argue that all depth cues should be conceptualized in a probabilistic framework. Such an approach has been explored in computer vision, where machine-learning techniques were used to combine statistical information about depth and surface orientation provided by a diverse set of image features (Hoiem et al., 2005; Saxena et al., 2005). Some of these features were similar to known monocular depth cues, but others were not. From the information contained in a large collection of these features, the algorithms used in those studies were able to generate reasonably accurate estimates of 3D scene layout. These results show that useful metric information is available from image features that have traditionally been considered as nonmetric depth cues. Our results and the above-mentioned computer-vision results indicate that the conventional, geometry-based taxonomy that classiﬁes depth cues according to the type of distance information they provide is unnecessary. By capitalizing on the statistical relationship between images and the environment to which our visual systems have been exposed, the probabilistic approach used here will yield a richer understanding of how we perceive 3D layout.

References Akeley, K., Watt, S. J., Girshick, A. R., and Banks, M. S. (2004). A stereo display prototype with multiple focal distances. ACM Trans. Graphics, 23: 804–813. Backus, B. T., Banks, M. S., van Ee, R., and Crowell, J. A. (1999). Horizontal and vertical disparity, eye position, and stereoscopic slant perception. Vis. Res., 39: 1143–1170. Berkeley, G. (1709). An Essay Toward a New Theory of Vision. Dublin: Pepyat.

133

134

M. S. Banks and R. T. Held Blakemore, C. (1970). The range and scope of binocular depth discrimination in man. J. Physiol., 211: 599–622. Brillaut-O’Mahony, B. (1991). New method for vanishing point detection. CVGIP: Image Understand., 54: 289–300. Burge, J. D., Fowlkes, C. C., and Banks, M. S. (2010). Natural-scene statistics predict the inﬂuence of the ﬁgure-ground cue of convexity on human depth. J. Neurosci., 30: 7269–7280. Canny, J. F. (1986). A computational approach to edge detection. IEEE Trans. Pattern Anal. Machine Intel., 8: 679–698. Caudek, C. and Profﬁtt, D. R. (1993). Depth perception in motion parallax and stereokinesis. J. Exp. Psychol.: Hum. Percept. Perf., 19: 32–47. Cormack, L. K., Landers D. D., and Ramakrishnan, S. (1997). Element density and the efﬁciency of binocular matching. J. Opt. Soc. Am. A, 14: 723–730. Coughlan, J. M. and Yuille, A. L. (2003). Manhattan world: orientation and outlier detection by Bayesian inference. Neural Comput., 15: 1063–1088. Egusa, H. (1983). Effects of brightness, hue, and saturation on perceived depth between adjacent regions in the visual ﬁeld. Perception, 12: 167–175. Farid, H. and Simoncelli, E. P. (1998). Range estimation by optical differentiation. J. Opt. Soc. Am. A, 15: 1777–1786. Fielding, R. (1985). Special Effects Cinematography. Oxford: Focal Press. Fisher, S. K. and Ciuffreda, K. J. (1988). Accommodation and apparent distance. Perception, 17: 609–621. Flickr (2009). Tilt shift miniaturization fakes. http://www.visualphotoguide.com/tiltshift-photoshop-tutorial-how-to-make-fake-miniature-scenes/ Fry, G. A., Bridgeman, C. S., and Ellerbrock, V. J. (1949). The effects of atmospheric scattering on binocular depth perception. Am. J. Optom. Arch. Am. Acad. Optom., 26: 9–15. Gårding, J., Porrill, J., Mayhew, J. E. W., and Frisby, J. P. (1995). Stereopsis, vertical disparity and relief transformations. Vis. Res., 35: 703–722. Georgeson, M. A., May, K. A., Freeman, T. C. A., and Hesse, G. S. (2007). From ﬁlters to features: scale-space analysis of edge and blur coding in human vision. J. Vis., 7: 1–21. Haeberli, P. and Akeley, K. (1990). The accumulation buffer: hardware support for high-quality rendering. ACM SIGGRAPH Comput. Graphics, 24: 309–318. Held, R. T., Cooper, E. A., O’Brien, J. F., and Banks, M. S. (2010). Using blur to affect perceived distance and size. ACM Trans. Graphics, 29: 1–16. Helmholtz, H. von (1867). Handbuch der Physiologischen Optik. Leipzig: Leopold Voss. Hogervorst, M. A. and Eagle, R. A. (1998). Biases in three-dimensional structure-from-motion arise from noise in early the visual system. Proc. R. Soc. Lond. B, 265: 1587–1593. Hogervorst, M. A. and Eagle, R. A. (2000). The role of perspective effects and accelerations in perceived three-dimensional structure-from-motion. J. Exp. Psychol.: Hum. Percept. Perf., 26: 934–955. Hoiem, D., Efros, A. A., and Hebert, M. (2005). Geometric context from a single image. Proc. IEEE Int. Conf. Computer Vis., 1: 654–661.

Blur and perceived depth Kingslake, R. (1992). Optics in Photography. Bellingham, WA: SPIE Optical Engineering Press. Knill, D. C. (1998). Discrimination of planar surface slant from texture: human and ideal observers compared. Vis. Res., 38: 1683–1711. Koenderink, J. J. and van Doorn, A. J. (1992). Surface shape and curvatures scales. Image Vis. Comput., 10: 557–565. Laforet, V. (2007). A really big show. New York Times, May 31. Love, G. D., Hoffman, D. M., Hands, P. J. W., Gao, J., Kirby, A. K., and Banks, M. S. (2009). High-speed switchable lens enables the development of a volumetric stereoscopic display. Opt. Express, 17: 15716–15725. Marshall, J., Burbeck, C., Ariely, D., Rolland, J., and Martin, K. (1996). Occlusion edge blur: a cue to relative visual depth. J. Opt. Soc. Am. A, 13: 681–688. Mather, G. (1996). Image blur as a pictorial depth cue. Proc. R. Soc. Lond. B, 263: 169–172. Mather, G. (1997). The use of image blur as a depth cue. Perception, 26: 1147–1158. Mather, G. and Smith, D. R. R. (2000). Depth cue integration: stereopsis and image blur. Vis. Res., 40: 3501–3506. Mather, G. and Smith, D. R. R. (2002). Blur discrimination and its relationship to blur-mediated depth perception. Perception, 31: 1211–1219. McCloskey, S. and Langer, M. (2009). Planar orientation from blur gradients in a single image. In Proceedings of the IEEE Conference in Computer Vision and Pattern Recognition (CVPR), Miami, FL, pp. 2318–2325. Mon-Williams, M. and Tresilian, J. R. (1999). Ordinal depth information from accommodation? Ergonomics, 43: 391–404. Okatani, T. and Deguchi, K. (2007). Estimating scale of a scene from a single image based on defocus blur and scene geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN, pp. 1–8. O’Shea, R. P., Blackburn, S., and Ono, H. (1994). Contrast as a depth cue. Vis. Res., 34: 1595–1604. O’Shea, R. P., Govan, D. G., and Sekuler, R. (1997). Blur and contrast as pictorial depth cues. Perception, 26: 599–612. Palmer, S. E. (1999). Vision Science: Photons to Phenomenology. Cambridge, MA: MIT Press. Palmer, S. E. and Brooks, J. L. (2008). Edge-region grouping in ﬁgure-ground organization and depth perception. J. Exp. Psychol.: Hum. Percept. Perf., 24: 1353–1371. Pentland, A. P. (1987). A new sense for depth of ﬁeld. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI), 9: 523–531. Potetz, B. and Lee, T. S. (2003). Statistical correlations between 2D images and 3D structures in natural scenes. J. Opt. Soc. Am. A, 20: 1292–1303. Saxena, A., Chung, S. H., and Ng, A. Y. (2005). Learning depth from single monocular images. In Proceedings of IEEE Advances in Neural Information Processing Systems (NIPS), Vancouver. Spring, K. and Stiles, W. S. (1948). Variation of pupil size with change in the angle at which the light stimulus strikes the retina. Br. J. Ophthalmol., 32: 340–346.

135

136

M. S. Banks and R. T. Held Troscianko, T., Montagnon, R., Le Clerc, J., Malbert, E., and Chanteau, P. (1991). The role of color as a monocular depth cue. Vis. Res., 31: 1923–1930. Wallach, H. and Norris, C. M. (1963). Accommodation as a distance cue. Am. J. Psychol., 76: 659–664. Walsh, G. and Charman, W. N. (1988). Visual sensitivity to temporal change in focus and its relevance to the accommodation response. Vis. Res., 28: 1207–1221. Watt, S. J., Akeley, K., Ernst, M. O. and Banks, M. S. (2005). Focus cues affect perceived depth. J. Vis., 5: 834–862. Wilson, B. J., Decker, K. E., and Roorda, A. (2002). Monochromatic aberrations provide an odd-error cue to focus direction. J. Opt. Soc. Am. A, 19: 833–839.

7

Neuronal interactions and their role in solving the stereo correspondence problem jason m. samonds and tai sing lee 7.1

Introduction

Binocular vision provides important information about depth to help us navigate in a three-dimensional environment and allow us to identify and manipulate 3D objects. The relative depth of any feature with respect to the ﬁxation point can be determined by triangulating the horizontal shift, or disparity, between the images of that feature projected onto the left and right eyes. The computation is difﬁcult because, in any given visual scene, there are many similar features, which create ambiguity in the matching of corresponding features registered by the two eyes. This is called the stereo correspondence problem. An extreme example of such ambiguity is demonstrated by Julesz’s (1964) random-dot stereogram (RDS). In an RDS (Figure 7.1a), there are no distinct monocular patterns. Each dot in the left-eye image can be matched to several dots in the right-eye image. Yet when the images are fused between the two eyes, we readily perceive the hidden 3D structure. In this chapter, we will review neurophysiological data that suggest how the brain might solve this stereo correspondence problem. Early studies took a mostly bottom-up approach. An extensive amount of detailed neurophysiological work has resulted in the disparity energy model (Ohzawa et al., 1990; Prince et al., 2002). Since the disparity energy model is insufﬁcient for solving the stereo correspondence problem on its own, recent neurophysiological studies have taken a more top-down approach by testing hypotheses generated by computational models that can improve on the disparity energy model (Menz and

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

137

138

J. M. Samonds and T. S. Lee

? (a) The stereo correspondence problem RF disparity

Image disparity Left image

Right image (b) False match

Figure 7.1 (a) The stereo correspondence problem. When a point in space is projected on the retinas, depth can be derived from the retinal disparity between the projected points. Locally, disparity is ambiguous, as each dot in an RDS has numerous potential matches and therefore depth interpretations. The white dot on the right (black arrow) can be matched with an incorrect dot (gray arrow) or a correct dot (black arrow) on the left, yielding different interpretations. (b) A zoomed-in image of an RDS that leads to a false match in a position-shifted, disparity-tuned neuron. The left image has been shifted to the left with respect to the right image. The left receptive ﬁeld (dashed ovals) is shifted to the right with respect to the right receptive ﬁeld. The dots line up to optimally excite this neuron, suggesting that the local disparity in the image matches the preferred disparity of this neuron. This is because noncorresponding dots have lined up with the left and right receptive ﬁelds, so this neuron has detected a false match.

Neuronal interactions and stereo correspondence Freeman, 2003; Samonds et al., 2009a; Tanabe and Cumming, 2009). While these models are quite distinct in detail, they share the common theme that organized neuronal interactions among disparity-tuned neurons play a signiﬁcant role. This chapter will review these hypotheses and ﬁndings pertinent to them. 7.2

The disparity energy model

The disparity energy model was described in detail in Chapter 2 (by Qian and Li), so we will review only some basic details of this model that are essential to understanding this chapter. The key observation is that binoculardisparity-tuned neurons in the primary visual cortex have left- and right-eye receptive ﬁelds that can be modeled as two-dimensional Gabor ﬁlters in their respective retinotopic coordinates. These receptive ﬁelds differ with respect to each other in their mean retinotopic position or phase, causing the neuron to respond more strongly to binocular disparities equal to that difference: the preferred disparity. The response to stimuli with different binocular disparities results in a tuning curve that has the shape of a Gabor function with a peak response at this preferred disparity. The reason the disparity energy model is insufﬁcient for solving the stereo correspondence problem is because the receptive ﬁelds respond selectively to monocular features in binocular stimuli (Chen and Qian, 2004; Read and Cumming, 2007). For any given stimulus, the features in the left- and righteye ﬁelds can cause the neuron to respond strongly to an incorrect disparity (Figure 7.1b). Or, to put it another way, the neuron responds strongly to an incorrect, or false, match: similar features in the left- and right-eye ﬁelds that do not correspond to the same feature in space. Neurophysiologists characterize the disparity tuning of a neuron by presenting a series of random-dot patterns for each disparity and computing the average response to these random-dot patterns. This way, the selective responses to monocular features are averaged out to reveal the binocular-disparity tuning. Our brain does not have this luxury; it must compute the disparity for a single stimulus. 7.3

How to avoid false matches

In the simplest terms, the strategy for solving the stereo correspondence problem is to eliminate false matches and reinforce correct matches. The most straightforward method to eliminate false matches is to pool or average the responses of neurons with similar disparity tunings that have different tunings for other feature cues such as orientation (Qian and Zhu, 1997). Disparity-tuned neurons that are tuned for different orientations will respond differently to the

139

J. M. Samonds and T. S. Lee same stimuli and therefore tend to produce different false matches. Averaging their responses can ﬁlter out the different false matches. This procedure can then be repeated across different spatial scales or frequencies and across space as well (Qian and Zhu, 1997; Chen and Qian, 2004). Although this method does reduce the number of false matches, the performance is still poor, with inaccurate and imprecise location of edges, and regions with completely wrong disparity estimates (Figure 7.2). Instead of averaging across spatial frequencies, disparity can be computed in a coarse-to-ﬁne sequence (Marr and Poggio, 1979). This way, low spatial frequencies can be used to maximize the range of disparity estimates, and high spatial frequencies can be used to maximize the precision of disparity estimates. Chen and Qian (2004) applied this strategy to a neuronal-network model using neurons represented by the disparity-energy model. Neurons with a disparity tuning based on position differences in their left- and right-eye receptive ﬁelds (a) Orientation and spatial-frequency pooling Left

Right

Left

Right

Σ

Disparity (near to far)

140

(b) Right image and disparity map Figure 7.2 (a) Schematic illustration of orientation and spatial-frequency pooling of the receptive ﬁelds of neurons with the same preferred disparity. (b) Right image, and corresponding disparity map estimated by pooling of disparity-energy-model neurons (from Chen and Qian, 2004).

Neuronal interactions and stereo correspondence are ideal for providing a coarse estimate of disparity, while neurons with a disparity tuning based on phase differences are ideal for providing a ﬁner estimate of disparity. Combining this coarse-to-ﬁne procedure with orientation and spatial pooling does provide a substantial improvement of disparity estimates over simple pooling (Figure 7.3, top row, versus Figure 7.2), but the performance

Left

Right

Left

Right

Σ

Σ

Disparity (near to far)

Disparity (near to far)

(a) Orientation pooling and coarse-to-fine integration

(b) Right image and disparity map

Figure 7.3 (a) Schematic illustration of orientation pooling and and coarse-to-ﬁne integration of neurons with the same preferred disparity. (b) Right image, and corresponding disparity map estimated by pooling and coarse-to-ﬁne integration of disparity-energy-model neurons (top row, from Chen and Qian, 2004). Representative result from a modern computer vision algorithm (bottom row, from Tappen and Freeman, 2003).

141

142

J. M. Samonds and T. S. Lee still falls short of what modern computer vision algorithms are able to achieve (Figure 7.3, top versus bottom row). The errors are especially prevalent in lowcontrast regions, so Chen and Qian proposed that spatial interactions might be necessary to propagate disparity information from high-contrast regions to low-contrast regions. Marr and Poggio also suggested that there were conditions where a coarse-to-ﬁne algorithm would need spatial interactions, such as those proposed in their 1976 algorithm, to facilitate estimating disparity. There is some recent neurophysiological evidence supporting this coarseto-ﬁne strategy. Among the disparity-tuned neurons in the cat primary visual cortex, the strongest functional connections, as determined by cross-correlating spike trains, are from neurons with a broad tuning to neurons with a narrow tuning (Menz and Freeman, 2003). The broadly tuned neurons also respond earlier to visual stimulation than do narrowly tuned neurons. These two pieces of evidence suggest that disparity computations proceed in a coarse-to-ﬁne manner in the primary visual cortex. However, Menz and Freeman observed a greater number of functional connections from narrowly tuned neurons to broadly tuned neurons, although these were weaker. Functional connections were also observed between neurons with very similar disparity tunings, as well as between neurons with very different disparity tunings (Menz and Freeman, 2004). These additional observations suggest that the neuronal interactions involved in stereo processing are more complicated than a simple coarse-to-ﬁne scheme. Read and Cumming (2007) made a clever observation by asking why we need both position and phase-shift binocular-disparity neurons in the visual cortex, as either one alone is theoretically sufﬁcient for providing binocular-disparity information. They proposed that position-shift disparity-tuned neurons are better suited for determining disparity in natural scenes than phase-shift disparity-tuned neurons are. That is, if the receptive ﬁelds of neurons in the primary visual cortex effectively detect features, then the most natural mechanism for determining the depth of a feature would be to shift the receptive ﬁelds between the two eyes in a manner corresponding to the binocular disparity. Each receptive ﬁeld therefore detects the same feature. Phase-based disparitytuned neurons, on the other hand, do not correspond to what we naturally experience in binocular viewing. Each receptive ﬁeld can detect different features. Since false matches do not have to follow the predictions of binocular disparity during natural viewing, these phased-based disparity-tuned neurons are more likely to respond to false matches and can act as false-match detectors, or lie detectors. Thus, the activation of these phase-shift disparity-tuned neurons can serve to suppress the activity of the position-shift disparity-tuned neurons that respond to false matches. A simple application of these lie detectors

Neuronal interactions and stereo correspondence Left

Right

Left

Right

Suppression

Disparity (near to far)

(a) Suppression between position- and phase-shift-tuned neurons

(b) Right image and disparity map Figure 7.4 (a) Schematic illustration of suppression between competing position-shift and phase-shift disparity-tuned neurons (b) Right image, and corresponding disparity map estimated with a “lie detector” algorithm (top row, from Read and Cumming, 2007). Representative result from a modern computer vision algorithm (bottom row, from Tappen and Freeman, 2003).

does reduce the number of false matches, but the performance is still considerably poorer than modern computer vision algorithms (Figure 7.4, top versus bottom row). Tanabe and Cumming (2009) provided some neurophysiological evidence supporting the lie detector hypothesis. They ﬁtted linear–nonlinear Poisson models and computed spike-triggered covariance matrices from the responses of neurons in the primary visual cortex of macaques to binocular white noise stimulation. The principal components of each matrix corresponded to linear ﬁlters with excitatory and inhibitory elements in their model. They found several neurons that had responses explained by models composed of

143

J. M. Samonds and T. S. Lee a competing excitatory and inhibitory element. The disparity that caused maximum excitation in the excitatory element caused the maximum inhibition in the inhibitory element. This would be consistent with a lie detector neuron providing an inhibitory input to the recorded neuron. Indeed, Tanabe and Cumming found that this inhibitory element reduced the magnitude of false disparity peaks when they simulated responses to static binocular patterns in their model.

7.4

Why do computer vision algorithms perform better?

The improvements to the disparity energy model discussed above still fall noticeably short in their performance in comparison with modern computer vision solutions to the stereo correspondence problem (Figure 7.5). The primary reason why modern computer vision algorithms perform better is that they incorporate a number of computational constraints that are helpful for resolving ambiguity in stereo matching. The ﬁrst and foremost constraint is what Marr and Poggio (1976) termed the continuity constraint, which encapsulates the assumption that surfaces of objects tend to be smooth. A second common constraint is what Marr and Poggio called the uniqueness constraint, which allows only one depth estimate at each location. Figure 7.6 illustrates a neuronal adaptation of Marr and Poggio’s schematic description of their stereo computation algorithm. The gray circles represent neurons with near, zero, and far (from upper left to lower right) disparity tuning, and the gray triangles represent excitatory synaptic connections. The black circles and triangles represent inhibitory interneurons and inhibitory synaptic connections, respectively. As you move across space (lower left to upper right), neurons with the

Disparity (near to far)

144

Figure 7.5 Right image (left), ground truth disparity map (center), and a disparity map (right) obtained using belief propagation estimation of a Markov random ﬁeld (from Tappen and Freeman, 2003). This particular algorithm is not the state-of-the-art solution for this image or for the Middlebury database, but was chosen as a representative example to demonstrate how effective several graphical-model techniques are in solving the stereo correspondence problem.

Neuronal interactions and stereo correspondence

Near Zero

+

Far

e ac Sp D is pa rit y

Figure 7.6 Neuronal-network adaptation of the schematic representation of Marr and Poggio’s (1976) cooperative stereo computation algorithm.

same disparity tuning excite each other. The rationale is that surfaces tend to be smooth and continuous, so nearby neurons should come up with similar disparity estimates and therefore reinforce each other, enforcing the continuity constraint. Each feature in an image is unique and cannot be at two depths at the same time, so at each point in space, neurons with different disparity tunings inhibit each other to enforce the uniqueness constraint. This ensures that there is only a single disparity estimate for each point in space. Belhumeur and Mumford (1992) have formulated the stereo correspondence problem as a Bayesian inference problem. Depth perception can thus be formulated as a Bayesian inference problem, a notion that can be traced back to Helmholtz’s unconscious inference. Using Bayes’ theorem, the likelihood of a certain observation given a set of depth hypotheses is multiplied by the prior distribution of the hypotheses to yield the posterior distribution of the hypotheses. The disparity value where the posterior distribution peaks gives us the maximum a posteriori probability (MAP) estimate of the disparity at each location. A statistical prior is a probability distribution that encodes our assumptions about how a variable varies across space and how it varies with other variables; in this case it is presumably learned from the statistics of natural scenes. The Markov random ﬁeld is a graphical model that can be used to represent the dependency

145

146

J. M. Samonds and T. S. Lee relationships between the variables across space and scales, and between the variables and observations. In our case, it is a spatial lattice of disparity random variables, with their local dependencies and their likelihood speciﬁed in terms of probability distributions. In the framework of statistical inference, Marr and Poggio’s (1976) computational constraints can be considered as statistical priors of some kind. Since the continuity constraint captures spatial relationships, it can be referred to as a spatial prior. In modern statistical inference algorithms applied to the stereo correspondence problem, spatial priors provide the information necessary to infer disparity in regions of the image that have high uncertainty. Spatial priors can be learned by studying the statistics of scenes in three-dimensional environments or can be derived from our general or speciﬁc knowledge about relationships in the three-dimensional environments where stereo correspondence must be solved. The utility of such priors is that when you have a reliable disparity estimate for one location or for some spatial conﬁguration (e.g., the boundary of an object), you can use the most likely estimate of the disparity for a location that does not have reliable disparity information, based on the learned or deﬁned relationships between locations. For example, you might assign the same disparity within a boundary as the disparity estimated at that boundary. Uncertainty for a particular location could arise because of a lack of features, or because of low contrast and noisy features. Uncertainty can also arise even with reliable features when there are multiple potential matches (e.g., Figure 7.1) (Julesz and Chang, 1976). Modern statistical inference algorithms, such as the belief propagation and graph cut algorithms, have been shown to be very effective for propagating binocular-disparity information across space for interpolation and stereo-matching disambiguation, with reasonably good solutions (Scharstein and Szeliski, 2002; Tappen and Freeman, 2003; Klaus et al., 2006; Wang and Zheng, 2008; Yang et al., 2009). Using more accurate models of spatial priors can improve depth inference performance (Belhumeur, 1996; Scharstein and Pal, 2007; Woodford et al., 2009). These more effective spatial priors allow a stereo algorithm to clean up noisy estimates from spurious correspondences between the two images. They are especially valuable when stereo cues become weak or unreliable, such as in image regions of low contrast or with texture, and for objects far away from the observer (more than 20 feet), where disparity values are very small (Cutting and Vishton, 1995). In natural scenes, reliable information can be sparse and full of regions with ambiguity. For example, a blank wall has binocular-disparity information only along the edges and yet we perceive depth along the entire wall. This phenomenon is called surface interpolation. Spatial priors such as the continuity

Neuronal interactions and stereo correspondence constraint facilitate surface interpolation. Psychophysical studies provide many additional examples of how we exploit spatial dependencies when inferring disparity in regions with a lack of or ambiguous disparity information (Julesz and Chang, 1976; Ramachandran and Cavanaugh, 1985; Collett, 1985; Mitchison and McKee, 1985; Westheimer, 1986; Stevenson et al., 1991). 7.5

Neurophysiological evidence for spatial interactions

Firing rate

We have examined whether or not there was evidence in the macaque primary visual cortex to support the strategy of using spatial dependencies to infer local disparity (Samonds et al., 2009a). We recorded from small groups of neurons that had overlapping and distinct receptive ﬁelds, as well as similar and distinct disparity tunings. Spike trains of pairs of neurons were cross-correlated to determine whether or not there was a temporal correlation between their spike trains and, if so, the strength of that correlation. This is suggestive of whether or not the neurons are connected in some manner and, if so, how strongly. First, we show an example pair of neurons with neighboring receptive ﬁelds (Figure 7.7a) and very similar disparity tunings (Figure 7.7b). This pair is

Disparity (b) Similar tuning

1°

Spike correlation

0.02

–100

100 Lag time (ms)

(a) Receptive fields

(c) Cross-correlation

Figure 7.7 (a) An example pair of neurons in the primary visual cortex of a macaque monkey recorded while dynamic random-dot stereograms were presented (from Samonds et al., 2009a). (b) These neurons have very similar disparity tunings. (c) There is a signiﬁcant peak in the cross-correlation between their spike trains.

147

J. M. Samonds and T. S. Lee representative of two neurons running along the lower-left-to-upper-right axis in Figure 7.6. The Marr and Poggio model predicts that these two neurons will have an excitatory connection. Indeed, Figure 7.7c illustrates that these two neurons have a signiﬁcant positive correlation between their spike trains. A broad positive peak in a cross-correlation histogram can be produced by a pair of neurons connected by multiple excitatory synapses or having a polysynaptic excitatory connection (Moore et al., 1970). In Figure 7.8a, we plot the distance between the receptive ﬁelds versus the disparity-tuning similarity for neuronal pairs that had a signiﬁcant correlation between their spike trains. As you move up the y axis, where the receptive ﬁelds become farther apart and no longer overlap, only neurons with similar disparity tuning are connected: pairs with positive correlation (rdisp ) between their tuning curves. This means that neurons encoding different regions of the scene are likely to be connected only if they are tuned to similar disparities. Again, this is similar to what is shown in Figure 7.6 along the lower-left-toupper-right axis, except the constraint based on the neurophysiological data is not as stringent. In Marr and Poggio’s model, facilitation occurs across space only when the disparities are the same. The connections between neurons are also stronger when the disparity tunings of the neurons are more similar (Figure 7.8b). This is also consistent with the

n = 63 pairs

0.4

1.5 Spike correlation

r = 0.32 p = 0.01

ΔRF distance (°)

148

1.0

0.5

–1.0

0.0 0.5 –0.5 0.0 Disparity-tuning similarity (rdisp)

(a) Distance between receptive fields

–1.0 1.0

r = 0.50 p < 0.001

0.3 0.2 0.1

0 0.0 0.5 –0.1 Disparity-tuning similarity (rdisp)

–0.5

1.0

(b) Connectivity strength

Figure 7.8 Connectivity in the primary visual cortex (n = 63 neuron pairs) depends on disparity-tuning similarity (from Samonds et al., 2009a). (a) Distance between receptive ﬁelds versus disparity-tuning similarity for neurons that had signiﬁcant connectivity (spike correlation). (b) Strength of connectivity (spike correlation) versus disparity-tuning similarity (area under the half-height of the peak in the cross-correlation histogram).

Neuronal interactions and stereo correspondence continuity constraint because the connections provide stronger reinforcement when the neurons are tuned to the same disparity. If these results are indeed indicative of the representation of spatial priors in the primary visual cortex, we are limited in interpreting speciﬁc details of the potential priors because we tested the neurons only with frontoparallel surfaces deﬁned by dynamic random-dot stereograms. However, the prior does appear to be a more relaxed version of Marr and Poggio’s continuity constraint, which would be expected for a more sophisticated spatial prior that would allow a system to perform well in natural environments. Another characteristic feature that we observe from our cross-correlation measurements is that the connection strengths are dynamic. The strength of the connection depends on which disparity was presented to the monkey, and it varies over time. In Figure 7.9, we show how the spike correlation at the mutually preferred disparity (the value that gives the maximum product of the peak-normalized tuning curves) has a delayed increase relative to the other disparities presented. This observation is also consistent with the presence of some complex polysynaptic circuit that causes the interaction between neuronal pairs to have a substantial delay. The response to the preferred disparity is also enhanced over time, which supports the suggestion that these observed interactions result in mutual reinforcement among the population of neurons. Representative examples are shown in Figures 7.10a and b, which demonstrate that, over time, the secondarypeak responses to nonpreferred disparities are relatively suppressed compared with the primary-peak response at the preferred disparity. Theses examples also reveal that the primary peak becomes sharper over time. The data for the

n = 63 pairs

max(N1*N2) mean(N1*N2)

N1*N 2

0.45 0.30 0.15 0.00

0.25 Spike correlation

0.60

0

500 Time (ms)

1000

(a) Product of firing rates

0.20 0.15 0.10 0.05 0.00

0

500 Time (ms)

1000

(b) Cross-correlation

Figure 7.9 The dynamics of connectivity in the primary visual cortex (n = 63 neuron pairs) depends on disparity-tuning similarity (from Samonds et al., 2009a). (a) Population average of the product of normalized ﬁring rates. (b) Population average of the spike correlation.

149

J. M. Samonds and T. S. Lee Normalized firing rate

150

1.0

1.0

1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2 0.0 –1.0

100–150 150–250 250–450 450–850

ms ms ms ms

–0.5 0.0 0.5 Horizontal disparity (°)

(a)

0.0 1.0 –1.0

–0.5 0.0 0.5 Horizontal disparity (°)

(b)

0.0 1.0 –1.0

150–250 ms 450–850 ms

–0.5 0.0 0.5 Horizontal disparity (°)

1.0

(c)

Figure 7.10 Dynamics of disparity tuning in primary visual cortex (from Samonds et al., 2009a). (a), (b) Example disparity-tuning curves over time ﬁtted with Gabor functions. (c) Example disparity-tuning curves showing deviations from a Gabor function.

examples in Figures 7.10a and b were ﬁtted with a Gabor function, but when you inspect the data more closely, the relative enhancement of the response to the preferred disparity causes the tuning curve to deviate from this function over time (Figure 7.10c). The peak around the preferred disparity becomes narrower, while the valleys around nonpreferred disparities become broader. Lastly, we also have preliminary evidence suggesting that the connections between similarly tuned neurons are indeed facilitatory and enhance the responses of neurons that are presented with ambiguous disparity information (Samonds et al., 2009b). During recording, we presented macaques with stimuli similar to what Julesz and Chang (1976) presented humans with in a psychophysical study. Within the receptive ﬁeld of a recorded neuron, a random-dot stereogram was presented that could be perceived as near or far because it had multiple potential consistent matches (ambiguous disparity; Figure 7.11c). Outside the receptive ﬁeld, a small percentage of random dots were introduced that had only one consistent match (unambiguous disparity). When dots with near disparity were presented outside the receptive ﬁeld of a near-tuned neuron (Figure 7.11a, black arrow), the response was enhanced (Figure 7.11d, black). When dots with far disparity were presented outside the receptive ﬁeld of this same near-tuned neuron, the response was suppressed (Figure 7.11d, gray). This suggests that unobserved near-tuned neurons were excited by the unambiguous dots and enhanced the response of our recorded neuron driven by the ambiguous stimuli via the connections revealed in Figure 7.8. There was a substantial delay in this enhancement compared with the disparity-dependent response within the classical receptive ﬁeld (Figure 7.11b). This long-latency response followed a delayed increase in the correlation between the spike trains (Figure 7.9b), supporting the idea that this effect is mediated by propagation of the disparity signals through disparity-tuning-speciﬁc lateral facilitation in

Firing rate (sps)

Neuronal interactions and stereo correspondence Classical RF tuning

100 80 60 40 20 0 0

400 600 Time (ms)

800

1000

(b) Center response

Firing rate (sps)

(a) Disparity tuning

1 1°

200

100 80 60 40 20

Response to ambiguous with unambiguous surround

0 0

(c) Receptive field

200

400

600

800

1000

Time (ms) (d) Surround response

Figure 7.11 (a) An example neuron recorded in the primary visual cortex of a macaque monkey tuned for near disparities (Samonds et al., 2009b). (b) The post-stimulus time histogram for this neuron when a near (black) and far (gray) disparity are presented. (c) The neuron was presented with a dynamic random-dot stereogram that had a region covering the receptive ﬁeld that had consistent near and far binocular matches. (d) The post-stimulus time histogram for the neuron when unambiguous random dots with near (black) and far (gray) disparity are presented outside the classical receptive ﬁeld.

the neuronal network. This is consistent with psychophysical data that show that the introduction of a small percentage of sparse unambiguous dots can strongly bias the subject’s 3D perception of ambiguous stimuli (Julesz and Chang, 1976). 7.6

Relationship with visual processing of contours and 2D patterns

The strategy described above for solving the stereo correspondence problem has analogous mechanisms for detecting contours and segmenting surfaces (Ren et al., 2005; Latecki et al., 2008; Maire et al., 2008). Graphical models, such as Markov random ﬁelds and conditional random ﬁelds, that capture spatial dependencies among variables have led to better algorithms for detecting contours, junctions, and surfaces. Psychophysical research has also suggested

151

152

J. M. Samonds and T. S. Lee that we detect contours using our prior knowledge about the statistical relationships between oriented segments across space (Field et al., 1993; Polat and Sagi, 1994). These statistical relationships form what is termed the association ﬁeld. The structure of the association ﬁeld has been shown to be consistent with the co-occurrence statistics of oriented edges across space in natural scenes (Elder and Goldberg, 2002; Geisler et al., 2001; Sigman et al., 2001), suggesting that it indeed represents spatial priors that we exploit for optimal contour inference. The neural evidence that we have observed suggests that there might exist an association ﬁeld in V1 in the binocular-disparity domain for surface interpolation. Anatomical studies of the primary visual cortex in several species reveal an organization that could support this inference mechanism (Bosking et al., 1997; Gilbert and Weisel, 1989; Lund et al., 2003; Malach et al., 1993; Stettler et al., 2002). Neurophysiological recordings are also consistent with the anatomy, suggesting that organized connections between orientation-tuned neurons could facilitate contour detection by synchronizing and enhancing responses when the stimuli match the relationships predicted by the spatial priors (Gilbert et al., 1996; Kapadia et al., 1995, 2000; Samonds et al., 2006). Figure 7.12a shows an example of a neuron pair in the primary visual cortex recorded while drifting sinusoidal gratings were presented to a cat (Samonds et al., 2006). The neurons have neighboring receptive ﬁelds and similar orientation tunings (Figure 7.12b). Similarly to the disparity-tuned neurons shown in Figure 7.7, these neurons have a positive correlation between their spike trains (Figure 7.12c), suggesting that there is an analogous facilitation between neurons with similar orientation tunings and similar disparity tunings. Figure 7.12d also supports this analogy by conﬁrming that a stronger spike correlation is measured for neurons with more similar orientation tunings (cf. Figure 7.8b). However, there are systematic changes in the organized connectivity for twodimensional visual processing when drifting concentric rings are presented to pairs of neurons (Figure 7.13a). The example neurons shown here have neighboring receptive ﬁelds, but very different orientation tunings (Figure 7.13b). In this case, even though the neurons have very different orientation tunings, they still have a signiﬁcant positive correlation between their spike trains. This was true from the population perspective too. Neuronal pairs that have receptive ﬁelds aligned with the concentric rings have a strong spike correlation (Figure 7.13d). This ﬂexible facilitation for neurons with different preferred orientations is exactly what is predicted by the psychophysical data and natural-scene statistics. Local contour segments are enhanced when they are co-aligned with other segments, whether the contour is straight or curved. One interesting future experiment will be to test whether or not a similar shift in spike correlation

Firing rate

Neuronal interactions and stereo correspondence

Orientation (b) Similar tuning Spike correlation

0.02

–100

100 Lag time (ms)

(a) Receptive fields

(c) Cross-correlation

Spike correlation (peak height)

0.05 0.04 n = 127 pairs r = –0.43 p < 0.001

0.03 0.02 0.01 0.00 0 –0.01

20

40 60 80 ∆ Preferred orientation

100

(d) Connectivity strength Figure 7.12 (a) An example pair of neurons recorded in the primary visual cortex of a cat while drifting sinusoidal gratings were presented (from Samonds et al., 2006). (b) These neurons have very similar orientation tunings. (c) There is a signiﬁcant peak in the cross-correlation between their spike trains. (d) Strength of connectivity (spike correlation) versus orientation-tuning similarity (peak height in the cross-correlation histogram).

strengths occurs if neurons tuned for disparity are presented with slanted or curved surfaces, since the data shown in Figure 7.8b were based only on the presentation of frontoparallel surfaces. Ben-Shahar et al. (2003) have proposed that the spatial interactions in the primary visual cortex could be part of a general inference framework that easily

153

Firing rate

J. M. Samonds and T. S. Lee

Orientation

Spike correlation

(b) Different tuning 0.02

–100

100 Lag time (ms)

(a) Receptive fields

(c) Cross-correlation

0.05 Spike correlation (peak height)

154

0.04 n = 73 pairs r = 0.37 p = 0.001

0.03 0.02 0.01 0.00 0

–0.01

20

40 60 80 ∆Preferred orientation

100

(d) Connectivity strength Figure 7.13 (a) An example pair of neurons recorded in the primary visual cortex of a cat while drifting concentric rings were presented (from Samonds et al., 2006). (b) These neurons have very different orientation tunings. (c) There is a signiﬁcant peak in the cross-correlation between their spike trains. (d) Strength of connectivity (spike correlation) versus orientation-tuning similarity (peak height in the cross-correlation histogram).

extends from contour detection to analyze texture, shading, and stereo correspondence. This generalization makes sense, considering that we typically perform these tasks in concert. The results of the inference of each of the cues could facilitate the inference of the other cues. We would want this facilitation to exploit the statistical relationships between the cues in a similar

Neuronal interactions and stereo correspondence manner to how the spatial priors can be exploited. For example, we have been investigating the interaction between shading and binocular-disparity cues for inferring depth in the primary visual cortex. Statistical analysis of threedimensional scenes has revealed that near regions tend to be brighter than far regions (Potetz and Lee, 2003). It was found that populations of neurons in the primary visual cortex of macaques are likely to encode this statistical relationship. Neurons tuned for near disparities tended to prefer brighter surfaces than neurons tuned for far disparities (Potetz et al., 2006). This relationship and more sophisticated relationships between shading and depth can be exploited using graphical models to provide very powerful estimations of three-dimensional shape from shading information alone (Potetz and Lee, 2006). 7.7

Conclusions

We have described several different approaches to solving the stereo correspondence problem, and the neurophysiological data that support each of these approaches. Throughout this chapter, we have made a case for taking advantage of spatial priors to infer disparity because the strategy is so effective, but this method and the other methods that we have discussed are not mutually exclusive. Most of the present neurophysiological data are too general to draw strong conclusions about exactly what mechanism or mechanisms the brain might be using. For example, we have found evidence of fast suppression between neurons with very different disparity tunings but with overlapping receptive ﬁelds, which might reﬂect the uniqueness constraint proposed by Marr and Poggio (Samonds et al., 2009a). Such competitive interactions could also be consistent with the lie detector hypothesis of Read and Cumming (2007), and with the results reported by Tanabe and Cumming (2009), which could be considered as an alternative way of implementing the uniqueness constraint. Further and more precise experiments are needed to distinguish between these two hypotheses. Another general result, observed in many studies, is that the disparity tuning sharpens over time (Figure 7.10) (Menz and Freeman, 2003; Cumming and Thomas, 2007; Samonds et al., 2009a). The disparity tuning clearly deviates from a Gabor function with narrow peaks and broad valleys (Samonds et al., 2007, 2008, 2009a). This result only suggests that the primary visual cortex reﬁnes local disparity estimates over time, which is true for all of the proposed solutions to the stereo correspondence problem that we have described. Although much more research is necessary to elucidate the speciﬁc mechanisms of how the brain solves the stereo correspondence problem, the aggregate results of these neurophysiological studies of the primary visual cortex all lead

155

156

J. M. Samonds and T. S. Lee to the conclusion that organized neuronal interactions play a signiﬁcant role in reﬁning the local disparity estimates provided by the disparity-energy model. This suggests that future neurophysiological research on stereopsis might beneﬁt from multielectrode recordings and testing for contextual interactions in binocular receptive ﬁelds. Experiments should continue to be guided by the computational principles that have led to better solutions to the stereo correspondence problem. Additional neurophysiological data can, in turn, provide insight to develop more effective algorithms. Acknowledgments We appreciate the technical assistance and helpful discussions provided by Karen McCracken, Ryan Poplin, Matt Smith, Nicholas Hatsopoulos, Brian Potetz, and Christopher Tyler. This work was supported by NEI F32 EY017770, NSF CISE IIS 0713203, AFOSR FA9550-09-1-0678 and a grant from Pennsylvania Department of Health through the Commonwealth Universal Research Enhancement Program. References Belhumeur, P. (1996). A Bayesian approach to binocular stereopsis. Int. J. Comput. Vis., 19: 237–260. Belhumeur, P. and Mumford, D. (1992). A Bayesian treatment of the stereo correspondence problem using half-occluded regions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Champaign, IL, pp. 506–512. Ben-Shahar, O., Huggins, P. S., Izo, T., and Zucker, S. W. (2003). Cortical connections and early visual function: intra- and inter-columnar processing. J. Physiol. (Paris), 97: 191–208. Bosking, W. H., Zhang, Y., Schoﬁeld, B., and Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. J. Neurosci., 17: 2112–2127. Chen, Y. and Qian, N. (2004). A coarse-to-ﬁne disparity energy model with both phase-shift and position-shift receptive ﬁeld mechanisms. Neural Comput., 16: 1545–1577. Collett, T. S. (1985). Extrapolating and interpolating surfaces in depth. Proc. R. Soc. Lond. B, 224: 43–56. Cumming, B. G. and Thomas, O. M. (2007). Dynamics of responses to anticorrelated disparities in primate V1 suggest recurrent network interactions. Soc. Neurosci. Abstr., 716.1. Cutting, J. E. and Vishton, P. M. (1995). Perceiving layout and knowing distances: the integration, relative potency, and contextual use of different information about depth. In W. Epstein and S. Rogers (eds.), Perception of Space and Motion,

Neuronal interactions and stereo correspondence pp. 69–117. Vol. 5 of Handbook of Perception and Cognition. San Diego, CA: Academic Press. Elder, J. H. and Goldberg, R. M. (2002). Ecological statistics of Gestalt laws for the perceptual organization of contours. J. Vis., 2: 324–353. Field, D. J., Hayes, A., and Hess, R. F. (1993). Contour integration by the human visual system: evidence for a local “association ﬁeld.” Vis. Res., 33: 173–193. Geisler, W. S., Perry, J. S., Super, B. J., and Gallogly, D. P. (2001). Edge co-occurrence in natural images predicts contour grouping performance. Vis. Res., 41: 711–724. Gilbert, C. D. and Wiesel, T. N. (1989). Columnar speciﬁcity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9: 2432–2442. Gilbert, C. D., Das, A., Ito, M., Kapadia, M., and Westheimer, G. (1996). Spatial integration and cortical dynamics. Proc. Natl. Acad. Sci. USA, 93: 615–622. Julesz, B. (1964). Binocular depth perception without familiarity cues. Science, 145: 356–362. Julesz, B. and Chang, J.-J. (1976). Interaction between pools of binocular disparity detectors tuned to different disparities. Biol. Cybern., 22: 107–119. Kapadia, M. K., Ito, M., Gilbert, C. D., and Westheimer, G. (1995). Improvement in visual sensitivity by changes in local context: parallel studies in human observers and in V1 of alert monkeys. Neuron, 15: 843–856. Kapadia, M. K., Westheimer, G., and Gilbert, C. D. (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. J. Neurophysiol., 84: 2048–2062. Klaus, A., Sormann, M., and Karner, K. (2006). Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure. In Proceedings of the International Conference on Pattern Recognition (ICPR), Hong Kong, Vol. 3. pp. 15–18. Latecki, L. J., Lu, C., Sobel, M., and Bai, X. (2008). Multiscale random ﬁelds with application to contour grouping. In Proceedings of IEEE Advances in Neural Information Processing Systems (NIPS), Vancouver, pp. 913–920. Lund, J. S., Angelucci, A., and Bressloff, P. C. (2003). Anatomical substrates for functional columns in macaque monkey primary visual cortex. Cereb. Cortex, 3(1): 15–24. Maire, M., Arbelaez, P., Fowlkes, C., and Malik, J. (2008). Using contours to detect and localize junctions in natural images. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK. Malach, R., Amir., Y., Harel, M., and Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex. Proc. Natl. Acad. Sci. USA, 90: 10469–10473. Marr, D. and Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194: 283–287. Marr, D. and Poggio, T. (1979). A computational theory of human stereo vision. Proc. R. Soc. Lond. B, 204: 301–328.

157

158

J. M. Samonds and T. S. Lee Menz, M. D. and Freeman, R. D. (2003). Stereoscopic depth processing in the visual cortex: a coarse-to-ﬁne mechanism. Nature Neurosci., 6: 59–65. Menz, M. D. and Freeman, R. D. (2004). Functional connectivity of disparity-tuned neurons in the visual cortex. J. Neurophysiol., 91: 1794–1807. Mitchison, G. J. and McKee, S. P. (1985). Interpolation in stereoscopic matching. Nature, 315: 402–404. Moore, G. P., Segundo, J. P., Perkel, D. H., and Levitan, H. (1970). Statistical signs of synaptic interactions in neurons. Biophys. J., 10: 876–900. Ohzawa, I., DeAngelis, G. C., and Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: neurons ideally suited as disparity detectors. Science, 249: 1037–1041. Polat, U. and Sagi, D. (1994). The architecture of perceptual spatial interactions. Vis. Res., 34: 73–78. Potetz, B. R. and Lee, T. S. (2003). Statistical correlations between 2D images and 3D structures in natural scenes. J. Opt. Soc. Am., 20: 1292–1303. Potetz, B. R., and Lee, T. S. (2006). Scaling laws in natural scenes and the inference of 3D shape. In Proceedings of IEEE Advances in Neural Information Processing Systems (NIPS), Vancouver, pp. 1089–1096. Potetz, B. R., Samonds, J. M., and Lee, T. S. (2006), Disparity and luminance preference are correlated in macaque V1, matching natural scene statistics. Soc. Neurosci. Abstr., 407: 2. Prince, S. J. D., Cumming, B. G., and Parker, A. J. (2002). Range and mechanism of encoding of horizontal disparity in macaque V1. J. Neurophysiol., 87: 209–221. Qian, N. and Zhu, Y. (1997). Physiological computation of binocular disparity. Vis. Res., 37: 1811–1827. Ramachandran, V. S. and Cavanaugh, P. (1985). Subjective contours capture stereopsis. Nature, 317: 527–530. Read, J. C. A. and Cumming, B. G. (2007). Sensors for impossible stimuli may solve the stereo correspondence problem. Nature Neurosci., 10: 1322–1328. Ren, X., Fowlkes, C., and Malik, J. (2005). Scale-invariant contour completion using conditional random ﬁelds. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Vol. 2, Beijing, pp. 1214–1221. Samonds, J. M., Zhou, Z., Bernard, M. R., and Bonds, A. B. (2006). Synchronous activity in cat visual cortex encodes collinear and cocircular contours. J. Neurophysiol., 95: 2602–2616. Samonds, J. M., Potetz, B. R., and Lee, T. S. (2007), Implications of neuronal interactions on disparity tuning in V1. Soc. Neurosci. Abstr., 716: 4. Samonds, J. M., Potetz, B. R., Poplin, R. E., and Lee, T. S. (2008). Neuronal interactions reduce local feature uncertainty. Soc. Neurosci. Abstr., 568: 14. Samonds, J. M., Potetz, B. R., and Lee, T. S. (2009a). Cooperative and competitive interactions facilitate stereo computations in macaque primary visual cortex. J. Neurosci., 29: 15780–15795.

Neuronal interactions and stereo correspondence Samonds, J. M., Poplin, R. E., and Lee, T. S. (2009b). Binocular disparity in the surround biases V1 responses to ambiguous binocular stimuli within the classical receptive ﬁeld. Soc. Neurosci. Abstr., 166: 21. Scharstein, D. and Pal, C. (2007). Learning conditional random ﬁelds for stereo. In Proceedings of the IEEE International Concference on Computer Vision and Pattern Recognition (CVPR), Minneapolis, MN. Scharstein, D. and Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comp. Vis., 47: 7–42. Sigman, M., Cecchi, G. A., Gilbert, C. D., and Magnasco, M. O. (2001). On a common circle: natural scenes and Gestalt rules. Proc. Natl. Acad. Sci. USA, 98: 1935–1940. Stettler. D. D., Das, A., Bennett, J., and Gilbert, C. D. (2002). Lateral connectivity and contextual interactions in macaque primary visual cortex. Neuron, 36: 739–750. Stevenson, S. B., Cormack, L. K., and Schor, C. (1991). Depth attraction and repulsion in random dot stereograms. Vis. Res., 31: 805–813. Tanabe, S. and Cumming, B. G. (2009). Push–pull organization of binocular receptive ﬁelds in monkey V1 helps to eliminate false matches. Soc. Neurosci. Abstr., 852: 11. Tappen, M. F. and Freeman, W. T. (2003). Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Nice, France, pp. 900–907. Wang, Z. and Zheng, Z. (2008). A region based stereo matching algorithm using cooperative optimization. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK. Westheimer, G. (1986). Spatial interaction in the domain of disparity signals in human stereoscopic vision. J. Physiol., 370: 619–629. Woodford, O., Fitzgibbon, A., Torr, P., and Reid, I. (2009). Global stereo reconstruction under second order smoothness priors. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL. Yang, Q., Wang, L., Yang, R., Stewénius, H., and Nistér, D. (2009). Stereo matching with color-weighted correlation, hierarchical belief propagation and occlusion handling. IEEE Trans. Pattern Anal. Mah. Intell., 31: 492–504.

159

PART II MOTI ON AND NAVI GATI ON I N 3 D

8

Stereoscopic motion in depth robert s. allison and ian p. howard

8.1

Introduction

In 1997, we were designing experiments to assess the stability of correspondence between points in the two retinas and the phenomenon of stereoscopic hysteresis (Diner and Fender, 1987; Fender and Julesz, 1967). As part of these experiments, we presented binocularly uncorrelated random-dot images to the two eyes in a stereoscope. Binocularly uncorrelated images produce a percept of noisy, incoherent depth, since there is no consistent disparity signal. However, when we moved the images in the two eyes laterally in opposing directions we obtained a compelling sense of coherent motion in depth. When the display was stopped, the stimulus again appeared as noisy depth. We quickly realized that the motion-in-depth percept was consistent with dichoptic motion cues in the stimulus. Thus, a compelling sense of changing depth can be supported by a stimulus that produces no coherent static depth. This was quite surprising, since experiments several years earlier had suggested that stereoscopic motion-in-depth perception could be fully explained by changes in disparity between correlated images. Unknown to us, Shioiri and colleagues had made similar ﬁndings, which they reported at the same Association for Research in Vision and Opthalmology (ARVO) meeting where we ﬁrst presented our ﬁndings (Shioiri et al., 1998, 2000), although we found out they had also presented them earlier at a meeting in Japan. We performed a number of experiments on this phenomenon, reported as conference abstracts (Allison et al., 1998; Howard et al., 1998) that were subsequently cited. However, we have never properly

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

163

164

R. S. Allison and I. P. Howard documented these studies, since Shioiri et al. had priority. This chapter reviews these studies in terms of their original context and in the light of subsequent research. 8.2

Visual cues to motion in depth

The principal monocular cues to motion in depth are changing accommodation, image looming, and motion parallax between the moving object and stationary objects. Binocular cues take three forms: •

•

•

Changing absolute binocular disparity of the images of the moving object. If the eyes ﬁxate the moving object, the changing absolute disparity is replaced by changing vergence. If the eyes are stationary, the disparity produced by the moving object increases as a function of the tangent of the angle of binocular subtense of the moving object. Changing internal binocular disparity. The disparity between the different parts of the approaching object increases in accord with the inverse quadratic relation between relative disparity and object distance. Changing relative binocular disparity between the images of the moving object and the images of stationary objects. The relative disparity increases in proportion to the distance between the objects.

We are concerned with motion in depth simulated by changing the relative disparity between the images of two random-dot displays presented at a ﬁxed distance in a stereoscope. The observer ﬁxates a stationary marker, so there is also a changing relative disparity between the ﬁxation point and the display. There are three ways in which the visual system could, in theory, code changes in relative disparity. They are illustrated schematically in Figure 8.1 (Cumming and Parker, 1994; Portfors-Yeomans and Regan, 1996; Regan, 1993). First, the change in binocular disparity over time could be registered. We will refer to this as the change-of-disparity (CD) signal. Second, the opposite motion of the images could be registered. We will refer to this as the interocular velocity difference (IOVD) signal. These two coding mechanisms differ in the order in which information is processed. For the CD signal, the disparity at each instant is registered ﬁrst, and then the temporal derivative of the disparity codes motion in depth. In the IOVD signal, the motion of each image is registered ﬁrst, and then the interocular difference in motion codes motion in depth. The IOVD signal can be thought of as dichoptic motion parallax and may access the same mechanism as does monocular motion parallax. The third possibility is that motion in depth could

Stereoscopic motion in depth Left image Right image

Disparity detector

η

Motionin-depth

d/dt

“Change-of-disparity”

Left image

d/dt –

Right image

Motionin-depth

d/dt “Difference-of-velocity”

Left image dη/dt Right image

Motionin-depth

“Dynamic disparity detector”

Figure 8.1 Motion-in-depth mechanisms. The “change of disparity” operates on the disparity signal (i.e., in the cyclopean domain) to signal a changing disparity. The “difference-of-velocity” detector detects a pure interocular velocity difference between monocular motion detectors. The “dynamic disparity detector” is directly sensitive to changing disparity in binocularly matched features.

be coded by specialized detectors sensitive to changing disparity in the absence of instantaneous disparity signals. How can we dissociate these binocular signals? Cumming and Parker designed a display for this purpose. In a central region subtending 1.22◦ , each dot moved horizontally for 67 ms and was then replaced by a new dot. The cycle was repeated in a temporally staggered fashion, with dots in the two eyes moving in opposite directions. The dots were renewed in such a way as to keep the mean disparity within the central region constant. The surrounding random dots were stationary and unchanging. The short lifetime of the dots was beyond the temporal resolution of the system that detects changing disparity but not beyond the resolution of the monocular motiondetection system. Thus, it was claimed that the display contained a detectable IOVD signal but not a detectable CD signal. In a second display of oppositely moving dots, the spatial frequency of the depth modulation of a random-dot display was beyond the spatial resolution of the stereoscopic system but not that of the motion system. Motion in depth was not seen in either display. However, the unchanging disparity may have suppressed impressions of motion in depth (Allison and Howard, 2000). Furthermore, although monocular motion was visible in both displays, the motion thresholds were considerably elevated, perhaps to a point above the threshold for detection of an IOVD in the dichoptic images. One cannot assume that detectable monocular motions necessarily produce a detectable IOVD signal.

165

166

R. S. Allison and I. P. Howard The CD signal can be isolated by using a dynamic random-dot stereogram (DRDS) to remove the motion signals, and hence the IOVD, but leave the instantaneous disparities. In a DRDS, the dot patterns in both eyes are changed in each video frame, so that coherent monocular image motion is not present. At any instant, the dot patterns in the two eyes are the same, so that a CD signal or a dynamic-disparity signal can be extracted. Julesz (1971) created such a DRDS, successfully depicting a square oscillating in depth. Pong et al. (1990) noted that the impression of motion in depth in a DRDS was similar to that in a stereogram with persisting dot patterns (a conventional random-dot stereogram, RDS). Thus the CD signal alone is sufﬁcient to generate motion in depth. Furthermore, Regan (1993) and Cumming and Parker (1994) found that the threshold disparity and temporal-frequency dependence for detection of motion in depth in a DRDS were similar to those for a conventional RDS. They concluded that the IOVD signal provided no additional beneﬁt. Arguing from parsimony, Cumming and Parker proposed that the only effective binocular cue to motion in depth is that of CD based on differentiation of the same disparity signal used to detect static disparity, with no need for an IOVD mechanism. Before accepting this proposal, we need a measure of motion in depth created by only the IOVD signal. In all the above experiments, the same random-dot display was presented to the two eyes – the patterns in the two eyes were spatially correlated. Motion in depth can be created by changing the binocular disparity between motion-deﬁned shapes in a random-dot stereogram in which the dot patterns in the two eyes are uncorrelated (Halpern, 1991; Lee, 1970; Rogers, 1987). This effect demonstrates that spatial correlation of ﬁne texture is not required for disparity-deﬁned motion in depth. However, the motion-deﬁned forms in these displays were visible in each eye’s image even when the forms no longer changed in disparity. In other words, the signal generating motion in depth could have been the changing instantaneous disparity between these motion boundaries rather than a pure difference-of-motion signal. As we show here, motion in depth can be produced by spatially uncorrelated but temporally correlated displays (Allison et al., 1998; Howard et al., 1998; found independently by Shioiri et al., 1998). Our basic stimulus was one in which distinct random-dot displays were presented to each eye and moved coherently in opposite directions. Chance matches between dots in stationary images produce an impression of lacy depth. When the images move in opposite directions, the mean disparity of randomly matched dots at any instant remains constant. There is therefore no change in instantaneous mean disparity and therefore no CD signal. The stimulus leaves the IOVD signal intact but there is also a dynamic disparity signal, since all sets of randomly paired dots undergo a change in disparity of the same velocity and sign.

Stereoscopic motion in depth

8.3

Motion in depth from spatially uncorrelated images: effects of velocity and temporal frequency

The experiment reported here demonstrates that motion in depth can be produced by spatially uncorrelated but temporally correlated displays and that effective motion in depth can be produced either by the CD signal or by the IOVD signal. In the ﬁrst experiment, we looked at the effects of dot speed and temporal frequency on the percept of motion in depth in spatially uncorrelated moving displays. 8.3.1

Methods

The left and right eyes’ images were superimposed in a Wheatstone stereoscope and presented in isolation in an otherwise dark room. A pair of Tektronix 608 oscilloscope displays were located to the left and right of the subject and viewed through mirrors set at ±45◦ so that they formed a fused dichoptic image that appeared directly in front of the observer. The displays were computer-generated from a Macintosh computer video card at a resolution of 640 × 480 × 24 at 67 Hz (Quadra 950, Apple Computer Inc., Cupertino, CA). Monochrome left and right images were drawn into separate bit planes of the color video card (the green and red channels). This allowed perfect synchronization of the timing of the two video signals. Custom raster sweep generator hardware processed the left and right video signals to drive the horizontal and vertical raster positions in concert with the luminance modulation of the left and right displays and draw the images. Images were aligned with graticules (1 cm grid spacing with 2 mm markings) overlaid on the display screen and with each other by reference to an identical grid located straight ahead of the observer at a ﬁxation distance of 33 cm (seen through the semisilvered stereoscope mirrors). Spatial calibration resulted in a pixel resolution of 1.7 and 1.3 arcmin in the horizontal and vertical dimensions, respectively. The pixel size had no relation to disparity, as the disparity signals were provided separately as digitally modulated analog signals. In this experiment, a modulated triangle-wave signal was added to the horizontal raster sweep, causing the displayed images to oscillate to and fro at constant velocity (between alternations). Digital modulation controlled the peak–peak amplitude of the disparity oscillation (or, equivalently, the peak–peak differential displacement for uncorrelated stimuli) in steps of 3.5 arcsec over a range of ±2◦ . This allowed precise disparity signals without resorting to subpixel positioning techniques. Switching electronics allowed different disparity oscillations or stationary stimuli in different parts of the display.

167

168

R. S. Allison and I. P. Howard 17.5°

Uncorrelated images

1.75°

Fixation point

1.75°

Correlated images

Opposed to-and-fro motion

Left-eye dots

Right-eye dots

Figure 8.2 Basic stimulus arrangement. The observer ﬁxated the binocularly visible dot in the center of the stimulus and monitored ﬁxation via adjacent nonius lines (the left eye sees one line and the right eye sees the other). In one half of the stimulus (top in this example) was the test display, moving in opposite directions in the two eyes at a given frequency and velocity. For experiments using matching tasks, the oscillation of the correlated comparison image (bottom in this example) was adjusted by the observer to match the motion in depth of the test stimulus.

In the ﬁrst experiment, we measured the perceived velocity of the motion in depth of a spatially uncorrelated test display with respect to the motion in depth of a spatially and temporally correlated display. The two displays were presented one above and the other below a central ﬁxation point, with the test and comparison positions periodically interchanged to counterbalance (Figure 8.2). The basic test and comparison patterns subtended 17◦ in width by 1.75◦ in height and consisted of bright dots randomly distributed at a density of 1.5% on a black background. Fixation helped to control vergence, which was monitored by nonius lines. In the test display, the patterns of dots in the two eyes were independently generated and spatially uncorrelated. Correlated test images were also presented as controls. The half-images of the test display were moved coherently to and fro in opposite directions (in counterphase) to generate an IOVD signal in the test display or an IOVD and a CD signal in the comparison display. In any given trial, the test images moved from side to side in counterphase with a triangular displacement proﬁle with a peak–peak relative displacement amplitude of 3.75, 7.5, 15, 30, or 60 arcmin at a frequency of 0.2, 0.5, or 1 Hz, which produced alternating

Stereoscopic motion in depth segments of constant differential velocity between 0.75 and 120 arcmin/s. The boundaries of each image were stationary (deﬁned by the display bezel), so that there were no moving deletion–accretion boundaries. Since all the dots in each eye’s image moved coherently, there were no motion-deﬁned boundaries. The comparison display was the same except that the dot patterns in the two eyes were identical (spatially correlated). In each trial, the images of the comparison display moved at the same frequency as those of the test display and were initially set at a random disparity oscillation amplitude. The subject then used key presses to adjust the velocity of the images of the comparison display until the display appeared to move in depth at the same velocity as the test display. The motion in depth of the two displays was in phase in one set of trials and in counterphase in another set of trials. This was done to ensure that the motion in depth of the test display was not due to motion contrast or depth contrast. It also ensured that the motion in depth was not due to vergence tracking. We used a velocity-matching procedure because the amplitude of motion in depth of the uncorrelated display was undeﬁned, although, theoretically, the velocity signal could be integrated to produce an impression of depth amplitude. 8.3.2

Results and discussion

When the correlated images were stopped or presented statically, the display appeared displaced in depth with respect to the ﬁxation point by an amount related to the instantaneous disparity. The stopped uncorrelated display produced an impression of lacy depth because of chance pairings between dots. Thus, at any instant, the disparity between the correlated images was the same for all dot pairs and related to the relative image displacement, while the mean signed disparity between the uncorrelated images was always zero. When the dots in a spatially uncorrelated display move, however, all randomly paired images undergo a consistent change in disparity. This leaves the IOVD signal intact. For both types of display, motion in depth was not visible when the rest of the visual ﬁeld was blank. A ﬁxation point was sufﬁcient to trigger the percept of motion in depth. Fixation on a point ensures that impressions of motion in depth are not generated by vergence movements of the eyes. The mean results for 10 subjects are shown in Figure 8.3. It can be seen that the velocity of motion in depth created by the IOVD signal alone is an approximately linear function of the velocity of motion created by both a CD signal and an IOVD signal. This is true for all three frequencies of motion. Regression analysis conﬁrmed a signiﬁcant main effect of velocity, but no signiﬁcant main effect of frequency or interaction between velocity and frequency, and a tight relationship between the test and matched velocities (R 2 = 0.912). The slope of the regression function indicates that the velocity of the uncorrelated images

169

R. S. Allison and I. P. Howard 120 Differential velocity of correlated images (arcmin/s)

170

0.2 Hz 0.5 Hz 1.0 Hz 0.5 Hz DRDS comparison

100

80

60

40

20

60 20 40 80 100 Differential velocity of uncorrelated images (arcmin/s)

120

Figure 8.3 Matching between spatially uncorrelated and correlated motion-in-depth displays as a function of the frequency and velocity of the test display (N = 10). The data show the average velocity (± the standard error of the mean, s.e.m.) of a spatially correlated comparison display that was set to match the apparent velocity of motion in depth of the spatially uncorrelated, temporally correlated test display. The open diamonds show matches using a spatially correlated but temporally uncorrelated comparison stimulus (DRDS).

was about 10% higher than that of the correlated images when the two displays appeared to move in depth at the same velocity. In other words, the absence of the change-in-disparity signal from the uncorrelated display did not have much effect on the efﬁciency of the motion-in-depth signal. We have already mentioned that Regan, and Cumming and Parker found little loss in sensitivity to motion in depth in a correlated dynamic random-dot display which lacked the IOVD signal. Similarly, we found little loss in apparent depth when the CD signal was absent. It thus seems that good motion in depth is produced either by the CD signal or by the IOVD signal. 8.4

Effects of density

In this experiment, we investigated whether the motion in depth of an uncorrelated display varies with the dot density. The range of lacy depth experienced in static uncorrelated random-dot stereograms depends on the density, which could be a factor in motion in depth elicited by moving uncorrelated RDSs.

Matched velocity of correlated images (arcmin/s)

Stereoscopic motion in depth 40

30

20

10

0 0.1

1

10

Dot density (%) Figure 8.4 Effects of stimulus density (N = 5). The uncorrelated test display moved at a velocity of 30 arcmin/s and a frequency of 0.5 Hz.

8.4.1

Methods

The stimuli and methods were the same as in the previous experiment, with the following exceptions. In all trials, the uncorrelated test display moved at a velocity of 30 arcmin/s and a frequency of 0.5 Hz. The subject adjusted the velocity of the image motion in the comparison correlated display until the velocities of motion in depth of the two displays appeared the same. The dot density of both displays was the same and varied between 0.35 and 50%. 8.4.2

Results and discussion

Figure 8.4 shows that dot density had no signiﬁcant effect on the perceived velocity of motion in depth of the uncorrelated display relative to the correlated display. As in the previous experiment, the matched velocity of the comparison display was slightly less than that of the test display.

8.5

Stimulus features

We explored the effects of varying a number of features of the stimulus, including dot size, continuous motion, motion direction, and correlation. In a control experiment, the test display was the same but the comparison display was a dynamic random-dot stereogram in which the dots were spatially correlated but temporally uncorrelated (with a dot lifetime of one frame). As in the main experiments, the subject adjusted the velocity of the images in the comparison stimulus until the two displays appeared to move in depth

171

172

R. S. Allison and I. P. Howard at the same velocity. We thus compared the efﬁciency of the pure IOVD signal with that of the pure CD signal. In the main experiment, it was unlikely but conceivable that the subjects could have matched the velocity of monocular motion of the images rather than the velocity of perceived motion in depth. They could not adopt this strategy in this control condition, because the comparison stimulus contained no coherent monocular motion. Results for the control condition with DRDS comparison stimuli are also shown in Figure 8.3. There was little difference between matches made with RDS and DRDS comparison displays, suggesting that subjects were matching apparent motion in depth. When the triangle-wave disparity modulation was replaced with a sawtooth modulation, the IOVD speciﬁed a constant velocity in one direction with periodic abrupt resets. With correlated displays, this gave the expected impression of a stimulus that ramped from near to far (or vice versa for ramps of increasing uncrossed disparity) with periodic resets of position in the opposite direction. With uncorrelated displays, there was an impression of continuous motion in the direction speciﬁed by the IOVD punctuated by a disturbance as the waveform reset itself. When we presented a static uncorrelated test display with a single moving dot in place of the comparison display, the dot appeared to move in depth for all four observers. Similarly, if the dot was stationary and an IOVD was imposed on the uncorrelated display, relative motion in depth was perceived. Thus, a textured comparison display was not required, and the subject perceived relative motion in depth based on the relative IOVD between the stimuli. Anticorrelated stimuli do not normally give rise to reliable impressions of stereoscopic depth (Cumming et al., 1998), although thin anticorrelated features can produce an impression of depth (Helmholtz, 1909). Depth from anticorrelated thin features has been attributed to matching of opposite edges of the features in the two eyes, which have the same contrast polarity (e.g., Kaufman and Pitblado, 1969). Cumming et al. (1998) did not ﬁnd a large effect of dot size on depth from anticorrelated RDSs, and claimed that the effects of image scaling were due more to element spacing than to size. We studied the response to IOVD in 50% density dot stimuli with large (45 arcmin) or small (2.6 arcmin) dots in four observers. If the dots in the test display were anticorrelated (opposite signs in the two eyes) rather than uncorrelated, motion in depth was still perceived from IOVD for both dot sizes. This observation needs careful followup, but suggests that matching of luminance patches is not required for the perception of motion in depth from IOVD and also suggests that instantaneous matches of same-polarity edges may be made.

Stereoscopic motion in depth Vertical IOVD signals are not related to motion in depth in real-world stimuli. We swapped the horizontal and vertical inputs to the displays, effectively turning the displays on their sides. This produced vertical disparities and interocular velocity differences. None of nine observers perceived any motion in depth in these conditions. Rather, they reported rivalry or up–down motion. This suggests that motion parallax or binocular-rivalry effects were not responsible for the apparent motion in depth in the main experiment.

8.6

Lifetime

If the motion in depth in our displays arises from the IOVD signal, then the percept of motion in depth should degrade as the motion signal degrades. We degraded the motion signal by shortening the lifetime of individual dots. We measured the minimum dot lifetime required to produce motion in depth with a spatially uncorrelated random-dot display, and the effects of reduced dot lifetime on suprathreshold motion-in-depth percepts. 8.6.1

Methods

The methods were similar to those described for the main experiment with the following exceptions. In the present experiment, a fraction of the dots disappeared in each frame and were replaced by randomly positioned dots. The replacement rate was controlled so that dots survived for a variable number of frames. For suprathreshold measurements, motion-in-depth velocity was matched with a temporally and spatially correlated random-dot stereogram, as described earlier. The test display had an interocular velocity difference of either 30 or 15 arcmin/s at 0.5 Hz. For testing the discrimination of the direction of motion in depth, the test display had a sawtooth IOVD proﬁle. The comparison display was a stationary random-dot display. The subject reported whether the test display appeared to approach or to recede relative to the stationary display. We used a method of constant stimuli, with the ranges tailored to each subject based on pilot testing. 8.6.2

Results and discussion

As the dot lifetime was reduced, the matched velocity decreased for spatially uncorrelated displays but not for spatially correlated ones. Figure 8.5 shows that, for two representative subjects, perceived depth declined sharply at shorter dot lifetimes for spatially uncorrelated but not for spatially correlated test images. The spatially correlated display created a strong impression of motion in depth when the images were changed on every frame, at a frame

173

R. S. Allison and I. P. Howard

Matching efficiency

1.0

0.5

0.0 0

250

500

750

1000

Dot lifetime (ms)

1.0 Matching efficiency

174

0.5 Spatially uncorrelated Spatially correlated 0.0 0

250

500

750

1000

Dot lifetime (ms) Figure 8.5 Matching efﬁciency (proportion of matched velocity to test velocity) as a function of dot lifetime for spatially correlated and uncorrelated displays. Representative results from two observers are shown.

rate of 67 Hz. This percept arises from only the CD signal. A spatially uncorrelated dichoptic display that changed on every frame appeared as a ﬂickering display of dots. Such a display contains neither the CD signal nor the IOVD signal. Motion-in-depth discrimination thresholds were also obtained. Figure 8.6 shows a typical psychometric function. The black circles and the solid line show the percentage of correct responses and a probit ﬁt for motion-in-depth discrimination versus lifetime. When the spatially uncorrelated display was refreshed in each frame, performance was at the chance level, which corresponds to the loss of the motion signal. Performance increased with increasing dot lifetime: 75% correct performance was achieved with about a 40 ms dot lifetime. The crosses and the dashed line show the psychometric function for binocular discrimination of left–right lateral motion in the same subject with the same displays

Stereoscopic motion in depth

Probability of correct response

1.0 0.9 0.8 0.7 Motion in depth Lateral motion

0.6 0.5 0

20

40

60

80

100

120

Dot lifetime (ms) 1 frame = 14.92 ms Figure 8.6 Lifetime thresholds for a typical observer. The smooth curves show a probit ﬁt to the psychometric functions. The data point for motion in depth at 14.92 ms (probability of 0.475) is not visible at this scale.

when both half-images moved in the same direction. It can be seen that 75% correct performance was achieved at approximately the same dot lifetime. To reach a 75% discrimination threshold, our six subjects required dot lifetimes of between two and ﬁve frames. Reliable thresholds could not be obtained for a seventh subject. Across the six subjects, the thresholds for discrimination of motion in depth and lateral motion were not signiﬁcantly different for these spatially uncorrelated dichoptic displays (Table 8.1). On average, a dot lifetime of 52 ms was required for the discrimination of motion in depth. According to Cumming and Parker, this stimulus duration should be too short for the system that detects changing disparity but not too short for the motion detection system. If so, then our results demonstrate the existence of an IOVD signal. 8.7

Segregated stimuli

Dynamic disparity requires matchable features in the two eyes’ images. Nonmatching moving half-images should not produce motion in depth. In this experiment, the spatially uncorrelated random-dot display was broken up into strips. The strips in one eye alternated with those in the other eye. Shioiri et al. (2000) observed motion in depth in a spatially uncorrelated display in which the left- and right-eye images were segregated into thin alternating horizontal bands. The display was presented for only 120 ms to prevent subjects fusing the stripes by vertical vergence. Shioiri et al. concluded that this effect was due to interocular differences in image motion. However, there would also be

175

176

R. S. Allison and I. P. Howard Table 8.1. Dot lifetime thresholds for discriminating the direction of lateral motion and motion in depth. The threshold was the lifetime required to obtain 75% correct discrimination performance, as estimated from probit ﬁts to the psychometric functions Threshold (ms) Subject

Motion in depth

Lateral motion

1

36.1

37.5

2

67.0

54.3

3

45.5

39.9

4

48.7

19.7

5

38.4

42.4

6

76.4

52.8

52.0 ± 5.8

41.1 ± 5.1

Mean ± s.e.m.

shear disparity signals along the boundaries, which may have created motion in depth. We had found that motion in depth was not obtained in such displays when two alternating bands were separated by horizontal lines (Howard et al., 1998). The horizontal lines acted as a vergence lock as well as a separator, so that subjects could observe the display for longer periods. We conducted an experiment in order to try to determine the conditions necessary to perceive depth in vertically segregated displays. 8.7.1

Methods

The basic stimulus was as in the earlier experiments except for the addition of horizontal lines intended to assist in the maintenance of vertical vergence (Figure 8.7). The test image was divided into strips of dots, which alternated between the eyes. The strips were either abutting or separated by a dichoptic horizontal line. The lines provided a lock for vertical vergence so that subjects could observe the display for long periods without fusion of the left- and right-eye images. The lines also separated the moving dots so that spurious shear-disparity signals would be weakened or absent. The strip width was varied between 4 and 40 pixels per strip (Figure 8.8). 8.7.2

Results and discussion

Figure 8.9 shows the matched velocity as a proportion of the stimulus velocity, as a function of the strip width. As the strip width was increased, the perceived motion in depth deteriorated and switched to a percept of simple

Stereoscopic motion in depth 17.5°

Segregated images

1.75°

Vergence lock

Fixation point 1.75°

Correlated images Opposed to-and-fro motion Right-eye dots

Left-eye dots

Figure 8.7 Stimulus for study of effects of IOVD in segregated uncorrelated RDSs. The displays were similar to those in Figure 8.2 except that the test display was segregated into horizontal bands of exclusively left- or right-eye dots.

Right-eye dots Left-eye dots

Figure 8.8 Schematic illustration of the vertically segregated display. Right- and left-eye dots were presented in alternating strips as shown, either abutting or separated by dichoptic lines (shown as stippled lines in the ﬁgure).

shearing of strips of dots against each other. Motion in depth in these displays could arise from “direct” registration of dynamic disparity from spurious pairings of dots along the abutting edges. This spurious signal would be strengthened as the strip size decreased and the number of abutting edges increased. There was also a weak indication that motion in depth was stronger if the strips abutted rather than being separated by lines. The lines separating the strips retained the relative motion between the strips but weakened or eliminated spurious disparity signals. The fact that motion in depth still occurred with strips separated by lines demonstrates that spurious matches along the borders cannot fully explain the percept of depth in these displays (however, there is considerable tolerance for vertical disparity in human stereopsis; see Fukuda et al., 2009).

177

R. S. Allison and I. P. Howard Matched velocity as proportion of test velocity

178

1.0 0.8

Separated by lines Abutting

0.6 0.4 0.2 0.0 0

10

20

30

40

50

60

Texture strip width (arcmin) Figure 8.9 Segregation results (N = 8). The matching efﬁciency for displays where the left- and right-eye images were vertically segregated into strips is shown as a function of strip width for both abutting strips and strips separated by horizontal lines.

8.8

General discussion

The binocular images of an approaching object move in opposite directions. The relative image velocity is determined by the approach velocity, and the ratio of the velocities in the two eyes is determined by the point of impact within the observer’s frontal plane (Regan, 1993; Portfors-Yeomans and Regan, 1996). The binocular signals can differ in the order in which information is processed. For the CD signal, the disparity at each instant is coded, and changes in disparity code motion in depth. In the IOVD signal, the motion of each image is detected, and the differences code motion. Alternatively, motion in depth could be generated by specialized detectors sensitive to changing disparity in the absence of instantaneous disparity signals. In a DRDS, images are correlated spatially but not over time. There is no coherent motion of monocular images in such a display, so that motion in depth must be detected by a pure CD signal. This signal involves detection of momentto-moment disparity and then extraction of CD by a process of differentiation. As is well known, DRDS displays do not eliminate motion energy but make it incoherent. Thus, as Harris et al. (2008) pointed out, IOVD signals are still present in these displays but do not signal consistent motion in depth. We found that a sensation of motion in depth can be created by spatially uncorrelated but temporally correlated dichoptic images moving in opposite directions. Such a display has no change in mean disparity but produces a consistent IOVD signal. The effects of variation of dot density, dot lifetime, stimulus velocity, and oscillation frequency were studied. All subjects perceived strong

Stereoscopic motion in depth motion in depth in the uncorrelated display with many variants of the basic stimulus. No consistent impression of depth was obtained when the motion was stopped. Thus, dynamic depth can be created by a changing disparity in a display with zero mean instantaneous disparity. To control for effects of vergence eye movements, we used a display where dichoptic random-dot ﬁelds above and below a ﬁxation point were given opposite IOVDs (so that one ﬁeld appeared to recede as the other approached). In a recent review, Harris et al. (2008) noted that other studies have also used such a conﬁguration. They wondered whether two opposing IOVD signals were necessary to produce a robust perception of motion in depth with spatially uncorrelated RDSs. However, we produced strong motion in depth in a single spatially uncorrelated RDS with respect to a stationary dot. Two mechanisms could produce motion in depth in a spatially uncorrelated display. First, the effect could depend on a pure IOVD signal derived from an initial registration of motion in each image. Second, the effect could depend on a mechanism sensitive to a consistent sign of changing disparity between randomly paired dots in the absence of coherent instantaneous disparity signals. This implies persistence in the binocular matching process. In either case, the results imply sensitivity to changing disparity without reliance on a consistent static disparity signal. 8.8.1

Comparison of monocular motion signals

Several lines of evidence favor the existence of a true IOVD mechanism based on monocular motion signals (for recent reviews, see Harris et al., 2008; Regan and Gray, 2009). 8.8.1.1 Isolation of IOVD cues Like Shioiri et al. (2000), we have shown that motion in depth can be produced by dichoptic motion in stimuli that are not binocularly superimposed. It is difﬁcult to draw ﬁrm conclusions from experiments in which the right- and lefteye images are vertically segregated. Motion in depth in natural scenes involves relative motion of binocularly superimposed images, except in cases of monocular occlusion (Brooks and Gillam, 2007). Motion in depth with segregated images may arise from direct registration of changing disparity in spurious matches along the adjacent image bands. However, our results for bands separated by horizontal lines suggest that this may not be the whole story. Similarly, temporal segregation can be used to stimulate motion-sensitive mechanisms in each eye without presenting disparity. We attempted to isolate the effects of IOVD by use of negative motion aftereffects (MAEs). The logic was that if oppositely directed monocular MAEs were induced in each eye, they

179

180

R. S. Allison and I. P. Howard would combine during subsequent viewing of a stationary binocular stimulus to produce a negative IOVD aftereffect. Initially, we could not get a robust negative MAE in depth, despite robust monocular MAE. Rather than using negative aftereffects, Brooks (2002a) showed that simultaneous or sequential adaptation of each eye to a moving random-dot display reduced the perceived velocity of motion in depth produced by spatially uncorrelated images. This appears to be evidence for a pure IOVD mechanism. However, in the critical condition in these experiments, periods of left-eye and right-eye adaptation were alternated temporally. Although left- and right-eye images were not presented simultaneously, it is well established that the visual system integrates disparity signals over time. It is thus possible that CD mechanisms adapted during this period. Even so, one would expect a reduction in the aftereffect for sequential compared with simultaneous adaptation, but this was not the case. Later, we obtained a weak but reliable negative MAE in depth following adaptation to correlated or uncorrelated random-element stereograms moving in depth (Sakano and Allison, 2007; Sakano et al., 2005; Sakano et al., 2006). However, we could not obtain negative aftereffects of motion in depth from DRDS stimuli when effects of disparity adaptation were controlled. This ﬁnding suggests that pure CD mechanisms do not adapt signiﬁcantly for constant-velocity stimuli. However, Regan et al. (1998) reported decreased sensitivity to motion-in-depth oscillations following adaptation to DRDS oscillations, so it is possible that CD adaptation could have produced some MAE in the Brooks experiments. Further support for a role of IOVD in the perception of motion in depth comes from demonstrations of effects on perceived three-dimensional trajectories of motion in depth following adaptation to monocular motion (Brooks, 2002b; Shioiri et al., 2003) Thus, overall, adaptation evidence suggests the existence of a true IOVD mechanism not dependent on binocular matching, although these effects appear variable and relatively weak.

8.8.1.2 IOVD enhancement of motion in depth If IOVD mechanisms contribute to the perception of motion in depth, then one might expect improved performance with stimuli containing both IOVD and CD compared with those containing only CD (see the discussion of the results of Cumming and Parker (1994) and Regan (1993) in Section 8.2). Brooks and Stone (2004) found that the motion-in-depth speed-discrimination threshold for a DRDS, which contained only the CD signal, was 1.7 times higher than the threshold for a regular RDS, which contained both CD and IOVD signals. Thus, they concluded that the IOVD signal supplements the CD signal. However, Portfors-Yeomans and Regan (1996) found that motion-in-depth speed discrimination for RDS stimuli with both CD and IOVD signals was no better than that

Stereoscopic motion in depth for DRDS stimuli with only CD. Also, Gray and Regan (1996) found no advantage for detection of motion in depth. Regan and Gray (2009) noted that these null results do not rule out a small contribution of IOVD. They also suggested that the contribution of IOVD mechanisms may differ for suprathreshold speed discrimination compared with motion-in-depth detection (see also Harris and Watamaniuk, 1995). 8.8.1.3 Comparison of stereomotion and lateral motion Another line of evidence for the existence of IOVD mechanisms arises from comparison of the properties of stereomotion with those of static stereopsis and lateral motion. Such comparisons have been made for judgments including search (Harris et al., 1998), detection (Gray and Regan, 1996), apparent magnitude (Brooks and Stone, 2006b), and speed or direction discrimination (Brooks and Stone, 2004; Fernandez and Farell, 2005; Harris and Watamaniuk, 1995; Portfors-Yeomans and Regan, 1997), as well as for effects of eccentricity (Brooks and Mather, 2000), spatial and/or temporal frequency (Lages et al., 2003; Shioiri et al., 2009), direction (Brooks and Stone, 2006b), velocity and scale (Brooks and Stone, 2006a), contrast (Blakemore and Snowden, 1999; Brooks, 2001), and other stimulus parameters. The logic is that if IOVD mechanisms exist, they should reﬂect the properties of motion mechanisms rather than of static disparity mechanisms. For instance, lateral-motion discrimination degrades with increasing retinal eccentricity. Therefore, if motion in depth is coded by IOVD, then motion-in-depth discrimination should also deteriorate with increasing eccentricity. Brooks and Mather (2000) found that speed discrimination for lateral motion and for motion in depth were affected in the same way when the stimulus was moved 4◦ from the fovea. Discrimination of stationary depth intervals was not affected. Brooks and Mather concluded that IOVD signals are involved in the detection of motion in depth (see Harris et al. (2008) and Regan and Gray (2009) for recent reviews of studies comparing lateral motion with motion in depth). These types of conclusions are predicated on the idea that if motion-in-depth processing resembles lateral-motion processing more than static-disparity processing, then the motion-in-depth mechanisms are most likely based on monocular motion processing. The processing of changing disparity is subject to many of the same ecological considerations as lateral-motion processing and it is conceivable that it could develop properties distinct from static-disparity processing but similar to lateral-motion processing. 8.8.1.4 Selective deﬁcits or limitations The case for a specialized mechanism for motion in depth is most convincing when stereomotion is possible with stimuli that do not support static

181

182

R. S. Allison and I. P. Howard stereopsis. In this chapter, we reported that the impression of motion in depth required a dot lifetime of only about 50 ms. Cumming and Parker have argued that this is below the temporal limits of disparity processing but not too short for motion processing. If their logic is correct, then this result supports the pure IOVD hypothesis. Similarly, selective deﬁcits either in stereomotion perception or in static stereopsis provide evidence that a distinct functional mechanism exists (Richards and Regan, 1973). Watanabe et al. (2008) found several strabismic subjects who were sensitive to stereomotion but not to static stereopsis, although a role for eye movements cannot be ruled out, as ﬁxation was not controlled. 8.8.1.5 Summary Robust motion-in-depth perception with spatially uncorrelated displays, as reported in this chapter, provide strong evidence for mechanisms sensitive to dynamic disparity. Similarly, selective deﬁcits or limitations in static stereopsis, in the face of preserved stereomotion perception, provide evidence for dynamic disparity mechanisms that do not rely on static disparity processing. However, neither line of evidence requires a true IOVD mechanism as in Figure 8.1. The similarity of lateral-motion and stereomotion perception may reﬂect either an underlying common substrate or similar ecological constraints. The inﬂuence of spatially or temporally segregated monocular motion signals on stereomotion perception suggests the existence of a true IOVD mechanism not dependent on binocular matching. However, these effects are variable and relatively weak and it is difﬁcult to truly segregate the inputs. 8.8.2

Dynamic disparity

Normally, IOVD cues to motion in depth are associated with binocularly paired elements. While the evidence reviewed above suggests that unmatched, opposite monocular motion signals may produce motion in depth, an IOVD in a binocularly matched element should produce stronger evidence of motion in depth. We propose that the robust impression of motion in depth in uncorrelated RDSs with IOVD might arise more from the consistent sign of the changing disparity between randomly paired dots – a dynamic disparity signal. If the matching of dots at one instant is inﬂuenced by previous matches, then this should provide a coherent motion-in-depth signal from changing disparity. The perception of coherent motion from incoherent disparity signals suggests that binocular matches persist over time and that the changing disparity of a matched feature can be tracked. Thus, even though an uncorrelated RDS

Stereoscopic motion in depth appears as an incoherent volume of random depths, each consistently matched pair undergoes a coherent change in disparity. Mechanisms or channels that are selective for both binocular correspondence and interocular velocity difference (and perhaps oscillation frequency or the rate of change of disparity) could detect dynamic disparity. Such mechanisms would still rely on binocular pairing of dots but may differ from the mechanisms that process static stereopsis. These mechanisms would have evolved under many of the same constraints as lateral-motion perception, which may explain why lateral-motion and motion-in-depth perception share many common properties. These common properties may also reﬂect a common neurophysiological substrate without the need for postulating pure CD or IOVD mechanisms. A disparity detector with appropriate spatiotemporal tuning could be sensitive to a preferred change in disparity over time. This is consistent with the known physiology of disparity detection in early vision. Disparity-sensitive cells in V1, V2, and MT are jointly sensitive to disparity and motion. This spatiotemporal ﬁltering could provide a substrate for processing changing binocular disparity without an explicit IOVD or CD mechanism. However, to our knowledge, the required differences in interocular tuning for motion in cells sensitive to disparity have not been reported. It is also possible, perhaps likely, that dynamic disparity detectors rely on coarsely matched features. There is compelling evidence that a coarse, transient mechanism guides vergence eye movements and enables transient stereopsis (for a recent review, see Wilcox and Allison, 2009). This mechanism does not require precise matching of binocular features. Stereomotion analogues of this transient mechanism could produce motion in depth from spatially or temporally segregated stimuli. Such mechanisms would permit short-latency responses to rapidly moving objects. On the other hand, sustained stereoscopic mechanisms rely on precise binocular matching. Stereomotion analogues of these mechanisms would serve the perception and tracking of stereoscopically matched, slowly moving objects. Such sustained and transient stereomotion mechanisms would have complementary functions. Such a functional dichotomy may also help explain apparently conﬂicting results obtained with different tasks and with supratheshold versus near-threshold motion-in-depth stimuli (Harris and Watamaniuk, 1995; Regan and Gray, 2009). Acknowledgments The support of the Natural Science and Engineering Research Council of Canada is greatly appreciated.

183

184

R. S. Allison and I. P. Howard References Allison, R. S. and Howard, I. P. (2000). Stereopsis with persisting and dynamic textures. Vis. Res., 40: 3823–3827. Allison, R. S., Howard, I. P., and Howard, A. (1998). Motion in depth can be elicited by dichoptically uncorrelated textures. Perception, 27: ECVP Abstract Supplement. Blakemore, M. and Snowden, R. (1999). The effect of contrast upon perceived speed: a general phenomenon? Perception, 28: 33–48. Brooks, K. R. (2001). Stereomotion speed perception is contrast dependent. Perception, 30: 725–731. Brooks, K. R. (2002a). Interocular velocity difference contributes to stereomotion speed perception. J. Vis., 2: 218–231. Brooks, K. R. (2002b). Monocular motion adaptation affects the perceived trajectory of stereomotion. J. Exp. Psychol. Hum. Percept. Perf., 28: 1470–1482. Brooks, K. R. and Gillam, B. J. (2007). Stereomotion perception for a monocularly camouﬂaged stimulus. J. Vis., 7: 1–14. Brooks, K. R. and Mather, G. (2000). Perceived speed of motion in depth is reduced in the periphery. Vis. Res., 40: 3507–3516. Brooks, K. R. and Stone, L. S. (2004). Stereomotion speed perception: contributions from both changing disparity and interocular velocity difference over a range of relative disparities. J. Vis., 4: 1061–1079. Brooks, K. R. and Stone, L. S. (2006a). Spatial scale of stereomotion speed processing. J. Vis., 6: 1257–1266. Brooks, K. R. and Stone, L. S. (2006b). Stereomotion suppression and the perception of speed: accuracy and precision as a function of 3D trajectory. J. Vis., 6: 1214–1223. Cumming, B. G. and Parker, A. J. (1994). Binocular mechanisms for detecting motion-in-depth. Vis. Res., 34: 483–495. Cumming, B. G., Shapiro, S. E., and Parker, A. J. (1998). Disparity detection in anticorrelated stereograms. Perception, 27: 1367–1377. Diner, D. B. and Fender, D. H. (1987). Hysteresis in human binocular fusion: temporalward and nasalward ranges. J. Opt. Soc. Am. A, 4: 1814–1819. Fender, D. and Julesz, B. (1967). Extension of Panum’s fusional area in binocularly stabilized vision. J. Opt. Soc. Am., 57: 819–830. Fernandez, J. M. and Farell, B. (2005). Seeing motion in depth using inter-ocular velocity differences. Vis. Res., 45: 2786–2798. Fukuda, K., Wilcox, L. M., Allison, R. S., and Howard, I. P. (2009). A reevaluation of the tolerance to vertical misalignment in stereopsis. J. Vis., 9(2): 1. Gray, R. and Regan, D. (1996). Cyclopean motion perception produced by oscillations of size, disparity and location. Vis. Res., 36: 655–665. Halpern, D. L. (1991). Stereopsis from motion-deﬁned contours. Vis. Res., 31: 1611–1617. Harris, J. M. and Watamaniuk, S. N. J. (1995). Speed discrimination of motion-in-depth using binocular cues. Vis. Res., 35: 885–896.

Stereoscopic motion in depth Harris, J. M., McKee, S. P., and Watamaniuk, S. N. (1998). Visual search for motion-in-depth: stereomotion does not “pop out” from disparity noise. Nature Neurosci., 1: 165–168. Harris, J. M., Nefs, H. T., and Grafton, C. E. (2008). Binocular vision and motion-in-depth. Spat. Vis., 21: 531–547. Helmholtz, H. von (1909). Physiological Optics. English translation 1962 by J. P. C. Southall from the 3rd German edition of Handbuch der Physiologischen Optik. New York: Dover. Howard, I. P., Allison, R. S., and Howard, A. (1998). Depth from moving uncorrelated random-dot displays. Invest. Ophthalmol. Vis. Sci., 39: S669. Julesz, B. (1971). Foundations of Cyclopean Perception. Chicago, IL: University of Chicago Press. Kaufman, L. and Pitblado, C. B. (1969). Stereopsis with opposite contrast conditions. Percept. Psychophys., 6: 10–12. Lages, M., Mamassian, P., and Graf, E. W. (2003). Spatial and temporal tuning of motion in depth. Vis. Res., 43: 2861–2873. Lee, D. N. (1970). Binocular stereopsis without spatial disparity. Percept. Psychophys., 9: 216–218. Pong, T. C., Kenner, M. A., and Otis, J. (1990). Stereo and motion cues in preattentive vision processing – some experiments with random-dot stereographic image sequences. Perception, 19: 161–170. Portfors-Yeomans, C. V. and Regan, D. (1996). Cyclopean discrimination thresholds for the direction and speed of motion in depth. Vis. Res., 36: 3265–3279. Portfors-Yeomans, C. V. and Regan, D. (1997). Discrimination of the direction and speed of motion in depth of a monocularly visible target from binocular information alone. J. Exp. Psychol.: Hum. Percept. Perf., 23: 227–243. Regan, D. (1993). Binocular correlates of the direction of motion in depth. Vis. Res., 33: 2359–2360. Regan, D. and Gray, R. (2009). Binocular processing of motion: some unresolved questions. Spat. Vis., 22: 1–43. Regan, D., Gray, R., Portfors, C. V., Hamstra, S. J., Vincent, A., Hong, X., and Kohly, R. (1998). Catching, hitting and collision avoidance. In L. R. Harris and M. Jenkin (eds.), Vision and Action, pp. 171–209. Cambridge: Cambridge University Press. Richards, W. and Regan, D. (1973). A stereo ﬁeld map with implications for disparity processing. Invest. Ophthalmol. Vis. Sci., 12: 904–909. Rogers, B. J. (1987). Motion disparity and structure-from-motion disparity. [Abstract.] Invest. Ophthalmol. Vis. Sci., 28: 233. Sakano, Y. and Allison, R. S. (2007). Adaptation to apparent motion-in-depth based on binocular cues. Vision, 19(1): 58. (Winter Conference of The Vision Society of Japan, Tokyo.) Sakano, Y., Allison, R. S., and Howard, I. P. (2005). Aftereffects of motion in depth based on binocular cues. J. Vis., 5: 732–732.

185

186

R. S. Allison and I. P. Howard Sakano, Y., Allison, R. S., Howard, I. P., and Sadr, S. (2006). Aftereffect of motion-in-depth based on binocular cues: no effect of relative disparity between adaptation and test surfaces. J. Vis., 6: 626–626. Shioiri, S., Saisho, H., and Yaguchi, H. (1998). Motion in depth from interocular velocity differences. Invest. Ophthalmol. Vis. Sci., 39: S1084. Shioiri, S., Saisho, H., and Yaguchi, H. (2000). Motion in depth based on inter-ocular velocity differences. Vis. Res., 40: 2565–2572. Shioiri, S., Kakehi, D., Tashiro, T., and Yaguchi, H. (2003). Investigating perception of motion in depth using monocular motion aftereffect. J. Vis., 3: 856–856. Shioiri, S., Kakehi, D., Tashiro, T., and Yaguchi, H. (2009). Integration of monocular motion signals and the analysis of interocular velocity differences for the perception of motion-in-depth. J. Vis., 9: 1–17. Watanabe, Y., Kezuka, T., Harasawa, K., Usui, M., Yaguchi, H., and Shioiri, S. (2008). A new method for assessing motion-in-depth perception in strabismic patients. Br. J. Ophthalmol., 92: 47–50. Wilcox, L. M. and Allison, R. S. (2009). Coarse–ﬁne dichotomies in human stereopsis. Vis. Res., 49: 2653–2665.

9

Representation of 3D action space during eye and body motion w. pieter medendorp and stan van pelt

9.1

Introduction

Perhaps the most characteristic aspect of life, and a powerful engine driving adaptation and evolution, is the ability of organisms to interact with the world by responding adequately to sensory signals. In the animal kingdom, the development of a neural system that processes sensory stimuli, learns from them, and acts upon them has proven to be a major evolutionary advantage in the struggle for existence. It has allowed organisms to ﬂee danger, actively search for food, and inhabit new niches and habitats at a much faster pace than ever before in evolutionary history. The more complex animals became, the more extensive and specialized became their nervous system (Randall et al., 1997). Whereas some simple invertebrates such as echinoderms lack a centralized brain and have only a ring of interconnected neurons to relay sensory signals, vertebrates such as mammals have developed a highly specialized neural network, consisting of a central and a peripheral nervous system, in which each subunit has its own functional properties in controlling the body. While the spinal cord and brainstem are involved in controlling automated, internal vegetative processes such as heartbeat, respiration, and reﬂexes, the prosencephalon (the forebrain, containing the neocortex) has specialized in so-called higher-order functions, such as perception, action, learning, memory, emotion, and cognition (Kandel et al., 2000). The specialization of the neural control of movement is a major feature that distinguishes primates from other animals. This has led to highly specialized

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

187

188

W. P. Medendorp and S. Van Pelt and distinctive capabilities such as reaching, grasping, tool use, and ocular foveation. 9.2

Sensorimotor transformations

Despite the apparent automation and effortlessness of goal-directed behavior such as reaching and grasping, primate brains are presented with complex computational challenges that have to be solved to enable correct execution of such movements. For example, imagine you want to pick up a cup of espresso while reading a newspaper, as illustrated in Figure 9.1. To perform the appropriate grasping movement, you have to convert the visual information that your eyes provide about the cup’s location into speciﬁc muscle contractions that move your hand to the espresso. How this process, called a sensorimotor transformation, comes about is one of the major research topics in neuroscience. To understand the brain’s way of executing these transformations, neuroscientists often utilize the concepts of coordinate systems and reference frames, such as eye-centered, head-centered, body-centered, and Earth-centric coordinate frames (Soechting and Flanders, 1992). In a mathematical sense, a frame of reference is a set of rigid axes that are usually perpendicular to each other and intersect

Figure 9.1 Sensorimotor control in an everyday situation: picking up a cup of espresso while reading a newspaper. The cup’s sensory location (in the right visual hemiﬁeld) needs to be transformed into the proper arm motor command (“move left”), which requires a reference frame transformation.

Representation of 3D action space at one point, the origin. These axes allow the spatial location of an object to be deﬁned by a set of coordinates. This concept is used in a wide variety of applications, including applications outside of neuroscience. For example, Dutch GPS receivers (such as in car navigation systems) will provide you with your location according to the Dutch Rijksdriehoeksmeting (or RD grid), in coordinates relative to the Onze Lieve Vrouwe church tower in Amersfoort, in Cartesian (metric) units. However, internationally, the WGS84 grid is the GPS standard, telling you where you are relative to the Greenwich meridian, in degrees longitude and latitude. So, one single location can be described by different coordinates, depending on the reference frame and metric used to code the position. The concept of reference frames can also be applied to spatial representations in the brain. For example, all visual information that falls on the retinas of our eyes is expressed in coordinates relative to the fovea, and thus in an eye-based, gaze-centered (also called retinal) frame of reference. In contrast, when making a reaching movement, we have to reposition our hands and arms relative to our body. So the required motor commands specify coordinates in a body-centered, joint-based reference frame. In our example, the espresso cup is located in the right visual hemiﬁeld (so the retinal input coordinates are “to the right”), but it is to the left of the right hand (so the output command should be “move to the left”). Using these terms, we can understand sensorimotor transformations by describing how the brain translates the position of the cup from the coordinates of the retinas into the coordinates of a reference frame linked to the body. 9.3

Spatial constancy in motor control

It is essential for survival to be able to rely on a veridical representation of object locations in the outside world, for example for a predator to be able to catch its prey or for humans to handle tools. The process of keeping track of the locations of objects around us is referred to as spatial constancy (Helmholtz, 1867; Von Holst and Mittelstaedt, 1950) and is an important prerequisite for goal-directed behavior. While maintaining spatial constancy seems relatively easy as long as objects are visible, it is essential that it can also be relied on in the absence of current spatial input, since we do not always immediately initiate actions upon objects that we see. In some instances, a movement to a previously speciﬁed location must be delayed; for example, in the case of Figure 9.1, one might see the espresso but not reach for it, because it is still too hot. In this case, the location of the cup must be stored in memory, in so-called spatial memory, which can be used to guide a later movement. If the brain were to store this information in retinal coordinates, as “to the right,” then these memorized coordinates would immediately become obsolete when the

189

190

W. P. Medendorp and S. Van Pelt eyes look in another direction. In order to maintain spatial constancy, target locations that are important for programming future actions must either be stored in a form that is independent of eye movements or be internally updated to compensate for the eye movements (Crawford et al., 2004). In this chapter, we will review some recent insights into the signals and mechanisms by which the brain achieves spatial constancy in motor control. We thus leave undiscussed the seminal work that has been performed in relation to spatial updating for perceptual tasks (see Melcher and Colby (2008) for review). 9.4

Quality of spatial constancy

Over recent decades, a large number of psychophysical studies have shown that we can maintain spatial constancy fairly well. The ﬁrst who systematically investigated this issue were Hallett and Lightstone (1976), using the now classic double-step saccade task. Subjects were brieﬂy presented with two targets in the visual periphery and subsequently asked to make saccadic eye movements to both locations in sequence. In this task, the ﬁrst saccade dissociates the retinal location of the second target from the goal location of the second saccade. Thus, after the ﬁrst saccade, subjects have to recompute or update the intended amplitude and direction of the second saccade to make it reach the goal. Hallett and Lightstone showed that subjects could indeed perform the double-step task correctly, which was taken as evidence that spatial constancy can be maintained across saccadic eye movements. These ﬁndings were subsequently conﬁrmed in a series of monkey neurophysiological experiments by Mays and Sparks (1980). Since then, many other investigators have replicated these results. In more recent years, a large number of other behavioral experiments have elaborated on these results by testing spatial constancy in different task conditions. By now, it has been shown that subjects can also keep track of memorized targets across smooth pursuit (McKenzie and Lisberger, 1986; Blohm et al., 2003; Baker et al., 2003) and vergence eye movements (Krommenhoek and Van Gisbergen, 1994). Also, head movements (Guitton, 1992; Medendorp et al., 2002; Vliegen et al., 2005) and whole-body movements (Medendorp et al., 2003b; Baker et al., 2003; Li et al., 2005; Li and Angelaki, 2005; Klier et al., 2005, 2008) do not disrupt spatial constancy to a great extent. 9.5

Reference frames for spatial constancy

Theoretically, there are many ways to achieve spatial constancy for motor control, depending on the underlying reference frame and metric involved. One possibility is that spatial constancy is preserved in a body-centered

Representation of 3D action space reference frame. Spatial representations in body-centered coordinates are quite stable because they are not inﬂuenced by intermediate eye or head movements, but only by displacements of the body (see Battaglia-Mayer et al. (2003) for a review). Alternatively, the system could use a retinal frame of reference to code target locations. Such an eye-centered target representation (here referred to synonymously as a retinal, retinotopic, gaze-centered, or gaze-dependent representation) needs to be updated every single time the eyes move to maintain spatial constancy. For example, if the person in Figure 9.1 redirects her gaze towards her watch, the retinal location of the espresso cup shifts from the right to the left visual hemiﬁeld. So, internally, the spatial representation of the espresso’s position has to be updated accordingly from right to left to maintain spatial constancy in gaze-centered coordinates. Another possibility is that the spatial reference lies outside the body. For example, locations can be coded relative to other objects (e.g., relative to the table in Figure 9.1) or relative to gravity. These so-called allocentric representations do not require any updating of the object position for eye, head, or body movements, so spatial constancy is always assured. Of course, allocentric representations need to be transformed into egocentric signals when one is guiding movements. Which reference frame is the most advantageous in everyday behavior such as grasping and looking at objects is not directly obvious, and may be different for different actions. For example, in the absence of allocentric cues, a bodycentered representation seems optimal for reaching and pointing, because of the beneﬁt of stability. But, notably, there are also arguments in favor of an eyecentered coding that is shared across effectors. First, the visual system is the dominant sensory system for spatial input and many brain regions are involved in visual processing. This would make it computationally and energetically beneﬁcial to use a retinal frame of reference as much as possible. A second reason is related to the difference in spatial resolution of these coordinate frames. The transformation of retinotopic into body-centered coordinates might degrade the resolution of the original input. Another argument for an eye-centered frame may be that is is simpler to orchestrate multiple effectors when they move to the same target (Cohen and Andersen, 2002). Finally, there are also (theoretical) studies suggesting that the brain uses multiple reference frames simultaneously in building spatial representations, which are combined to support behavior depending on the task (Pouget et al., 2002). We will take up this topic again in a later section, but ﬁrst we turn to considering some of the neural signatures of the mechanisms subserving spatial constancy across saccades.

191

192

W. P. Medendorp and S. Van Pelt 9.6

Neural mechanisms for spatial constancy across saccades

The posterior parietal cortex (PPC) has been shown to play an important role in maintaining spatial constancy and processing sensorimotor transformations. In the primate brain, the PPC contains specialized subunits that process spatial information for different kinds of movements, such as saccades, reaching, and grasping (see Colby and Goldberg (1999), Andersen and Buneo (2002), Culham and Valyear (2006), and Jackson and Husain (2006) for reviews). Figure 9.2a, left, displays a top view of a rendered representation of the left hemisphere of the macaque cortex, in which several of these parietal subregions are highlighted. The putative human homologues of these monkey areas are depicted in Figure 9.2a, right, in a similar view. In both species, specialized areas are located within and around the intraparietal sulcus (IPS). For example, the anterior intraparietal area (AIP) of the monkey and its human functional equivalent are active during grasping, while the lateral intraparietal area (LIP) is involved in representing target locations of saccades. The medial intraparietal area (MIP) and extrastriate visual area V6A constitute the parietal reach region (PRR), which codes the targets of reaching movements.

(a)

CS

(b)

*

AIP LIP MIP V6A VIP IPS

400ms 400 ms

Saccade

Saccade

Figure 9.2 (a) Parietal sensorimotor areas. Left hemispheres of the macaque and human cortex, respectively. The color-coded regions, with their corresponding names, indicate regions along the intraparietal sulcus (IPS) that are involved in spatial processing and motor preparation. CS, central sulcus; AIP, anterior intraparietal area; LIP, lateral intraparietal area; MIP, medial intraparietal area; V6A, visual area 6A; VIP, ventral intraparietal area. Adapted from Culham and Valyear (2006), with permission. (b) Remapping of visual activity in monkey area LIP. Left panel: a neuron responds when a saccade brings the eye-centered receptive ﬁeld onto the location of a visual stimulus. Right panel: the same neuron responds when a saccade moves the receptive ﬁeld onto a previously stimulated location, indicating gaze-centered updating. Modiﬁed from Duhamel et al. (1992). A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

Representation of 3D action space Over recent decades, monkey neurophysiological studies have revealed that parietal regions such as the LIP and PRR encode and store target locations in gaze-centered maps. It has also been shown that activity is shifted in these gaze-centered maps to compensate for eye movements in order to maintain the correct coding of the target in retinal coordinates. A landmark study of the neural processes involved in spatial updating was performed by Duhamel et al. (1992), recording from macaque LIP neurons (Figure 9.2b). LIP neurons have retinotopic receptive ﬁelds, which means that the cell ﬁres whenever a stimulus is presented at a speciﬁc location that is ﬁxed with respect to the fovea. Furthermore, in many LIP neurons the neuronal activity is sustained even after the visual stimulus has disappeared, which can be regarded as a neural correlate of visuospatial working memory. Importantly, neural responses also occur when a saccade brings a stimulus into the receptive ﬁeld (see Figure 9.2b, left panel). This is a marker of gaze-centered updating in LIP. Interestingly, Duhamel et al. showed that this updating also occurs when the saccade is made after an offset stimulus (Figure 9.2b, right panel), which shows that the memory trace is updated. In fact, many neurons in the LIP were found to update their activity even before the saccade started (not shown). Such predictive updating implies that these neurons already “know” the size and direction of the upcoming eye movement, allowing maintenance of spatial constancy. This is possible only if these neurons have received an efference copy (also known as a corollary discharge) of the motor command to the eye muscles (Von Holst and Mittelstaedt 1950; Sperry 1950). Nakamura and Colby (2002) observed a similar (predictive) updating of memorized targets at earlier stages within the visual hierarchy, in the extrastriate visual areas V2, V3, and V3A. Furthermore, gazecentered updating processes have been observed in the superior colliculus (SC) (Walker et al., 1995) and the frontal eye ﬁelds (FEFs) (Umeno and Goldberg, 1997). In 2008 a pathway for providing efference copies to such regions was identiﬁed (SC⇒thalamus⇒FEF; see Sommer and Wurtz, 2008). The targets of reaching movements seem to be coded and updated in gaze-centered coordinates as well (Batista et al., 1999). In Batista et al.’s experiment, monkeys had to reach for remembered targets, while the initial hand and eye positions were systematically varied. Across a population of neurons in the PRR, a scheme that expressed reach targets using gaze-centered coordinates explained the reach-related activity better than a body-centered coding scheme did. Gaze-centered target remapping has also been observed in the human brain, using the coarse time resolution of functional magnetic resonance imaging (fMRI) (Medendorp et al., 2003a, 2005; Merriam et al., 2003, 2007), the millisecond temporal resolution of magnetoencephalography and electroencephalography (Bellebaum and Daum, 2006; Van Der Werf et al., 2008), and transcranial

193

194

W. P. Medendorp and S. Van Pelt magnetic stimulation experiments (Chang and Ro, 2007). For example, in the fMRI experiments of Medendorp et al. (2003a), gaze-centered updating of the targets of both eye and arm movements was demonstrated by a dynamic exchange of activity between the two cortical hemispheres when an eye movement brought the target representation into the opposite hemiﬁeld. Conversely, malfunctioning of this gaze-centered updating mechanism could explain the deﬁcits that occur in visuomotor updating in patients with optic ataxia (Khan et al., 2005). For clarity, it should be noted that gaze-centered remapping does not contradict the idea that the same brain regions might also be involved in processing spatial attention (Golomb et al., 2008) or in implicitly transforming the gaze-centered signals into other reference frames, using position signals from the eyes or other body parts, expressed in terms of gain ﬁelds (Andersen et al., 1990; DeSouza et al., 2000) or muscle proprioception (Wang et al., 2007; Balslev and Miall, 2008), with the ultimate goal of formulating movement commands in effector-centered coordinates. It should also be mentioned that representations in other reference frames are found in other cortical areas such as the VIP (head-centered coding) (Avillac et al., 2005) and the AIP (handcentered signals) (Grefkes et al., 2002), while hippocampal place cells appear to code for self-position in allocentric coordinates (see Burgess et al. (2002) for a review). While most of these neurophysiological experiments have tested the mechanisms for updating across saccades, it is clear that such movements are only one type of the many challenges faced by the brain to create spatial constancy. Real-life movements typically involve the motion of many more body parts, such as the head and trunk, and often take place in a visually complex 3D environment. It is known that spatial constancy can be maintained during more complex movements (see Section 9.4), but the underlying spatial computations have not been identiﬁed. Will the gaze-centered-updating hypothesis generalize to such more complex situations of eye, head, and body motion? While this question still awaits experimentation at the neural level, recent experiments have been performed that deal with this issue at the behavioral level, which will be discussed next. 9.7

Gaze-centered updating of target depth

Behavioral experiments can examine the computations in spatial memory across a plethora of dynamics such as eye, head, and body translations and rotations, and reaching and grasping movements (e.g., Baker et al., 2003; Li and Angelaki, 2005; Klier et al., 2005, 2007, 2008; Medendorp et al., 2002, 2003b).

Representation of 3D action space A typical approach here is to evaluate the variable or systematic components of the errors that subjects make under speciﬁc task constraints. A behavioral experiment that contributed strongly to the current insights into spatial representations for motor actions was performed by Henriques et al. (1998), a few years after the classic experiment by Duhamel et al. described above. In this study, the experimenters investigated the reference frame for pointing across saccades, using the observation that subjects tend to overshoot the direction of (memorized) targets in the visual periphery in their pointing movements (Bock, 1986). Henriques et al. exploited this relationship by having subjects point to targets initially presented brieﬂy at the fovea, but brought into the retinal periphery by an intermediate saccadic eye movement. Interestingly, the observed pointing errors corresponded to the new – updated – retinal target direction, proving that target direction was kept in a dynamic, gaze-centered map that was updated across eye movements. This corresponds to the neurophysiological ﬁndings described by Batista et al. (1999), indicating the strength of psychophysical experiments in probing the brain’s spatial computation. In terms of action control, Henriques et al. (1998) proposed that gaze-centered coordinates are used to store and update multiple spatial targets, with only the target selected for motor execution being transformed into a body-centered frame (“conversion-on-demand”). While keeping track of the direction of a target may be adequate for directing saccadic or pointing movements, it will be insufﬁcient for retaining spatial stability for 3D movements such as reaching or grasping. For these types of actions, information about the distance (or depth) of the target is also required. It is generally accepted that depth and directional information are processed separately in the brain (Vindras et al., 2005), so it can be assumed that the spatial constancy of the respective dimensions is guaranteed by different processes as well. Little was known about how spatial constancy in depth is preserved until Li and Angelaki (2005) showed that monkeys are able to update target depth after passive translation in depth, a ﬁnding which was replicated in humans (Klier et al., 2008). In addition, Krommenhoek and Van Gisbergen (1994) showed that vergence movements are taken into account when looking at remembered targets at different distances. However, none of these studies addressed the nature of the computations involved in coding target depth. We performed a reaching experiment using a paradigm like that used by Henriques et al. (1998) to address this question (Van Pelt and Medendorp, 2008). In our experiment, illustrated in Figure 9.3, we tested between a gazedependent and a gaze-independent coding mechanism by measuring errors in human reaching movements to remembered targets (e.g., T in Figure 9.3a) in peripersonal space, brieﬂy presented before a vergence eye movement

195

W. P. Medendorp and S. Van Pelt (a)

(b) FP2

T

FP Near Far

(c)

Updating gain 6

Vergence shift

# Subjects

196

FP1

4 2 0

Gaze-dependent model Gaze-independent model Data

0 No updating

1 Perfect updating

Figure 9.3 Depth updating. (a) After a target T is viewed with binocular ﬁxation at FP1 (18◦ vergence), a vergence shift is made to FP2 (8◦ ), followed by a reaching movement to the remembered location of T (13◦ ). Gray outline circles: possible stimulus locations. Black circles, exemplar stimuli. (b) In black: reach patterns across all targets after a vergence shift from near to far, averaged across subjects. The gray solid and dotted patterns reﬂect gaze-centered and gaze-independent predictions, respectively. The best match is found with the gaze-centered model. (c) Updating gains. Across the population of subjects, gains are distributed around a value of 1. Modiﬁed from Van Pelt and Medendorp (2008).

(e.g., from FP1 to FP2). Analogously to Henriques et al. (1998), we reasoned that if reach errors were to depend on the distance of the target relative to the ﬁxation point of the eyes, then subjects would make the same errors if their internal representations were remapped to the gaze-centered location during the eye movement, even if they viewed the target from another position. Alternatively, if the brain were to store a gaze-independent depth representation of the target by integrating binocular-disparity and vergence signals at the moment of target presentation, the reach errors would not be affected by the subsequent vergence eye movement. With intervening vergence shifts, our results showed an error pattern that was based on the new eye position and on the depth of the remembered target relative to that position (Figure 9.3b). This suggests that target depth is recomputed after the gaze shift, supporting the gaze-centered model. Interestingly, a recent report by Bhattacharyya et al. (2009) indicates that PRR neurons code depth with respect to the ﬁxation point, which is consistent with these observations. Further quantitative analyses of our results showed that the updating system works with a gain of 1, which suggests that the reach errors arise after the updating stage, i.e., during the subsequent reference frame transformations that are involved in reaching (Figure 9.3c).

Representation of 3D action space 9.8

Spatial-constancy computations during body movements

The evidence for gaze-centered spatial updating reviewed above was obtained exclusively with rotational eye movements. Baker et al. (2003) made the same claim for rotational body movements. More speciﬁcally, these authors compared saccadic precision after horizontal whole-body rotations, smoothpursuit eye movements, and saccadic eye movements to memorized targets that remained either ﬁxed in the world or ﬁxed to the gaze. Based on the assumption of noise propagation at various processing stages in the brain, they rejected explicit world- or head-centered representations as an explanation of their results. Instead, they favored a ﬂexible, gaze-centered mechanism in the coding of spatial constancy. Would a gaze-centered mechanism also contribute to the coding and updating of spatial targets during translational motion? In everyday life, translational eye movements occur during a rotation of the head about the neck axis and/or during translational motion of the head and body, for example when we walk around or drive a car. In these situations, a gaze-centered updating mechanism must perform more sophisticated computations than during rotational motion. First, while rotations do not change the distance of an object from the observer, translations do. Second, during translations, visual objects at different depths from the plane of ﬁxation move at different velocities across the retina. This is known as motion parallax, and the same spatial geometry needs to be taken into account in the eye-centered updating of remembered targets during translational motion. Figure 9.4 illustrates this issue in more detail. The left-hand panels show the updating of two remembered visual stimuli T1 and T2 (a) across a leftward eye rotation (b). In this case, the locations of the two targets simply have to be updated by the same (angular) amount to retain spatial constancy (c). In contrast, a translation of the eye requires the same two target locations to be updated by different amounts (right-hand panels). This implies that the internal models in the brain must simulate the translational–depth geometry of motion parallax in order to instantiate spatial updating during translational motion. In recent work, it was shown that human subjects can maintain spatial constancy during self-initiated sideways translations (Medendorp et al., 2003b). Similar results were found during passive whole-body translations of humans (Klier et al., 2008) and nonhuman primates (Li and Angelaki, 2005). However, none of these studies explicitly tested the reference frames involved in implementing this spatial constancy. Toward this goal, we examined spatial updating during active translational motion (Van Pelt and Medendorp, 2007).

197

198

W. P. Medendorp and S. Van Pelt Rotational updating

Translational updating

T1 T2

(a) Translation

Rotation

(b)

(c) Figure 9.4 Geometrical consequences of rotational (left-hand panels) and translational (right-hand panels) eye movements for gaze-centered spatial updating. The directions of two targets (T1, T2) presented at the same retinal location (a) have to be updated by the same amount during an eye rotation ((b), left-hand panel), rotating the stored locations through the inverse of the eye’s rotation in space ((c), left-hand panel). An eye translation ((b), right-hand panel) requires direction updating by different amounts, depending on target distance ((c), right-hand panel). Adapted from Medendorp et al. (2008).

To do this, we brieﬂy presented subjects with visual targets, presented at different distances from the ﬁxation plane (Figure 9.5a, middle panel). After target presentation, subjects were asked to actively translate their body by stepping sideways, and subsequently reach to the remembered locations. We reasoned as follows. Suppose the updating mechanism operates in gaze coordinates, necessitating the simulation of motion parallax with remembered targets; then internal representations of targets initially ﬂashed behind and in front of the ﬁxation point should shift in opposite directions in spatial memory (Figure 9.5a, left panel). If so, these representations will be ﬂawed if the amount of translation is misestimated during the neural simulations, so that the updated internal representations do not correspond to the actual locations in space. As a result, this will cause errors in the reaches performed, with the errors for targets presented in front of the ﬁxation point in a direction opposite to the errors for

Representation of 3D action space Gaze dependent

Gaze independent

Ef

Tf Ef FP

En

En Tn

Partial updating

Translation

Partial updating

(a)

One subject

in G de a pe zend en t 10 cm

e- nt az e G end p de

Error far (Ef)

10 cm

All subjects

Error near (En) (b) Figure 9.5 (a) Translational updating. Gaze-centered representations of locations of targets, ﬂashed in front of and behind the ﬁxation point of the eyes (Tn and Tf, respectively), must be updated in opposite directions when the eyes translate (left). In contrast, gaze-independent representations of these targets (e.g., body-centered representations) require updating in the same direction (right). If the updating is only partial as a result of an erroneous estimation of step size, the internal representations do not encode the actual location, with different error predictions (En and Ef) for the two reference frames. (b) Updating errors for targets ﬂashed at opposite distances from the ﬁxation point favor a gaze-centered updating mechanism, in all subjects. Modiﬁed from Van Pelt and Medendorp (2007).

199

200

W. P. Medendorp and S. Van Pelt targets initially presented behind the ﬁxation point. In contrast, if the brain does not simulate the parallax geometry, but rather implements spatial constancy across translational motion in a gaze-independent reference frame (e.g., a body-centered frame), misjudging translations will lead to reach errors in the same direction, irrespective of whether the targets were presented in front of or behind the ﬁxation point (Figure 9.5a, right panel). The observed error patterns clearly favor the gaze-centered scheme (Figure 9.5b), which indicates that translational updating is organized along much the same lines as head-ﬁxed saccadic updating. The neural basis of translational updating may, just like rotationaleye-movement updating, be found in the parietal cortex or in any other cortical or subcortical structures involved in updating, as long as they have the necessary signals, including ego-velocity signals and stereoscopic depth and direction information, at their disposal. However, neurophysiological evidence is lacking. Finally, although the work reviewed in this section makes a clear case for a role of gaze-centered coordinates in keeping spatial constancy in motor control, it is important not to overplay this claim. We certainly do not want to argue that this frame is used at all times and in all contexts. Other reference frames are also involved in coding spatial representations (see, e.g., Olson and Gettner, 1996; Carrozzo et al., 2002; Hayhoe et al., 2003; Battaglia-Mayer et al., 2003; Pouget et al., 2002). For example, the person in Figure 9.1 can see the desk on which the espresso is and may be familiar with the spatial layout of the room. Such allocentric cues might be of assistance in inferring the current spatial locations of objects that have disappeared from sight. Also, gravity, another type of allocentric cue, may contribute to the coding and maintenance of a spatial representation. We have shown the importance of gravity-based coordinates, i.e., an allocentric reference frame, in coding targets during whole-body tilts (Van Pelt et al., 2005). In our test, we exploited the fact that subjects, when tilted sideways in the dark, make systematic errors in indicating world-centered directions (Mittelstaedt, 1983; De Vrijer et al., 2008), while the estimation of egocentric directions remain unaffected. We then tested whether the accuracy of a saccade, directed at a memorized location of a visual target brieﬂy presented before a whole-body rotation in roll, was affected by this distortion. We reasoned that if the locations of targets are coded allocentrically, the distortion affects their coding and their readout after the rotation of the body. In the case of an egocentric framework, no systematic storage or readout errors are expected. However, in this case, updating for the rotation of the body is needed, and the errors, if any, should relate to the amount of intervening rotation. The ﬁndings clearly favored the allocentric model: the accuracy of the saccade reﬂected the combination of subjective distortions of the Earth frame

Representation of 3D action space that occurred when the memory was stored (at the initial tilt angle) and when the memory was probed (at the ﬁnal tilt angle). Thus gravito-inertial cues, if available, could make a contribution to constructing and maintaining spatial representations, although they cannot drive the motor response. In this respect, it seems most logical that the brain interchanges information between allocentric maps and egocentric representations in the organization of spatially guided motor behavior (Crawford et al., 2004). 9.9

Signals in spatial constancy

If the neural system is to maintain a correct representation of target locations in memory, it has to take the effects of all intervening body movements on that representation into account. There are many ways in which the amount of self-movement can be registered and how these signals can be transformed and combined to guide the updating process. When movements are passive, such as when we ride on a train or drive a car, the amount of motion has to be estimated by our internal sensors. In the presence of light, information about head and body motion in space comes from both the optokinetic system, a visual subsystem for motion detection based on optic ﬂow, and the vestibular system. The vestibular system, located in the inner ear, is composed of otoliths and semicircular canals, which detect the head’s linear acceleration and angular velocity, respectively, in all three dimensions (see Angelaki and Cullen (2008), for a review). These systems operate in different frequency domains. The optokinetic system is well equipped to represent sustained movements, whereas the vestibular system is specialized for the detection of transient movements. It is also known that the optokinetic and the vestibular input converge at the level of the vestibular nuclei (Waespe and Henn, 1977). Thus, there is good reason to believe that visual–vestibular sensor fusion is essential for veridical detection of body movements. In the absence of visual cues, when the brain depends strongly on vestibular information to reconstruct motion in space, the limitations of the vestibular system show up. The semicircular canals, which measure angular velocity, exhibit high-pass ﬁlter characteristics, with poor responses during low-frequency and constant-velocity movements (Vingerhoets et al., 2007). The otoliths sense gravito-inertial forces and cannot distinguish between tilt and linear accelerations, for elementary physical reasons (Young et al., 1984; Vingerhoets et al., 2007). Moreover, the vestibular signals are noisy (Fernández and Goldberg, 1976) and initially code information in a head-centered frame of reference. Also, the somatosensory system may detect body movements, which include proprioceptive signals that provide information about the relative positions of the body, head, eyes, and limbs. So, when our car pulls away, together

201

202

W. P. Medendorp and S. Van Pelt these sensors can estimate how much we move, by combining the feeling of acceleration, the optic ﬂow caused by objects in the environment, and the pressure on our back caused by being pushed into the chair. By integrating these sensory signals in an appropriate fashion, a remapping signal can be generated that is used to update target representations and maintain spatial constancy. Although we will not go into detail here, it can easily be understood that these computations are complex owing to differences in the noise properties and internal dynamics of each sensor and in the intrinsic reference frames that each sensor uses (see MacNeilage et al. (2008) for a review of computational approaches). One strategy of the system may be to rely more on one than on the other, perhaps depending on sensory context and other constraints. Within this perspective, recent reports have suggested that Bayesian statistics may be a fundamental element of the processing of noisy signals for perception and action (Niemeier et al., 2003; Körding and Wolpert, 2006; MacNeilage et al., 2007; Laurens and Droulez, 2007; De Vrijer et al., 2008). This idea implies that the brain processes the noisy, ambiguous signals in a statistically optimal fashion, implementing the rules of Bayesian inference. Whether the optimal integration of neural signals in multiple reference frames and perhaps the biases in the transformations between them would apply to keeping spatial constancy requires further investigation (Vaziri et al., 2006). The relative importance of each sensor in updating for passive movements can be inferred when subjects are deprived of input in a speciﬁc sensory modality. This approach has revealed that vestibular signals are crucial in the computation of updating signals. For example, monkeys whose vestibular system was lesioned showed severe deﬁcits in spatial updating during passive body motion (Li and Angelaki, 2005). Importantly, when our movements are not passive but self-initiated, for example when we reﬁxate our gaze or during walking, our target representation may have access to information from an additional updating system, parallel to the sensory feedback stream. This feedforward process exploits the fact that we have knowledge about the kind of movement that is coming up. Efference copy signals of the intended eye, head, or body movement generate the updating signals. This process may be similar to or perhaps integrated with the one used in the sensory feedback stream, since it might predict or simulate the sensory information that the movement would generate. Because of the early initiation of this process – when there is only a movement plan but no actual movement yet – the feedforward updater is fast and may facilitate an optimal maintenance of spatial constancy. The neural correlates of the feedforward updating signals include predictive remapping activities observed in various cortical regions, as described above. This way, updating for an active movement will be faster, and perhaps more accurate, than for a similar, passive movement.

Representation of 3D action space 9.10

Conclusions

The ﬁndings described in this chapter shed light on how the human brain implements spatial constancy in motor control. A central feature of these observations is the dominance of gaze-centered computations, although we have seen that spatial representations do exist in other coordinate systems as well. While gaze-centered coordinates may play a role in maintaining spatial constancy, further reference frame transformations are required to execute a movement to a remembered target. The neural basis of the updating processes may be found in the parietal and/or frontal regions, but it should also be clear that more work is needed to assign the computational functions to speciﬁc physiological substrates and to identify the extraretinal signals involved. Acknowledgments The authors thank Dr. Luc Selen for valuable comments on an earlier version of the manuscript. This work was supported by grants from the Netherlands Organization for Scientiﬁc Research (VIDI: 452-03-307) and the Human Frontier Science Program (CDA). References Andersen, R. A. and Buneo, C. A. (2002). Intentional maps in posterior parietal cortex. Annu. Rev. Neurosci., 25: 189–220. Andersen, R. A., Bracewell, R. M., Barash, S., Gnadt, J. W., and Fogassi, L. (1990). Eye position effects on visual, memory, and saccade-related activity in areas LIP and 7a of macaque. J. Neurosci., 10: 1176–1196. Angelaki, D. E. and Cullen, K. E. (2008). Vestibular system: the many facets of a multimodal sense. Annu. Rev. Neurosci., 31: 125–150. Avillac, M., Deneve, S., Olivier, E., Pouget, A., and Duhamel, J. R. (2005). Reference frames for representing visual and tactile locations in parietal cortex. Nature Neurosci., 8: 941–949. Baker, J. T., Harper, T. M., and Snyder, L. H. (2003). Spatial memory following shifts of gaze. I. Saccades to memorized world-ﬁxed and gaze-ﬁxed targets. J. Neurophysiol., 89: 2564–2576. Balslev, D. and Miall, R. C. (2008). Eye position representation in human anterior parietal cortex. J. Neurosci., 28: 8968–8972. Batista, A. P., Buneo, C. A., Snyder, L. H., and Andersen, R. A. (1999). Reach plans in eye centered coordinates. Science, 285: 257–260. Battaglia-Mayer, A., Caminiti, R., Lacquaniti, F., and Zago, M. (2003). Multiple levels of representation of reaching in the parieto-frontal network. Cereb. Cortex, 13: 1009–1022.

203

204

W. P. Medendorp and S. Van Pelt Bellebaum, C. and Daum, I. (2006). Time course of cross-hemispheric spatial updating in the human parietal cortex. Behav. Brain Res., 169: 150–161. Bhattacharyya, R., Musallam, S., and Andersen, R. A. (2009). Parietal reach region encodes reach depth using retinal disparity and vergence angle signals. J. Neurophysiol., 102: 805–816. Blohm, G., Missal, M., and Lefvre, P. (2003). Interaction between smooth anticipation and saccades during ocular orientation in darkness. J. Neurophysiol., 89: 1423–1433. Bock, O. (1986). Contribution of retinal versus extraretinal signals towards visual localization in goal directed movements. Exp. Brain Res., 64: 467–482. Burgess, N., Maguire, E. A., and O’Keefe, J. (2002). The human hippocampus and spatial and episodic memory. Neuron, 35: 625–641. Carrozzo, M., Stratta, F., McIntyre, J., and Lacquaniti, F. (2002). Cognitive allocentric representations of visual space shape pointing errors. Exp. Brain Res., 147: 426–437. Chang, E. and Ro, T. (2007). Maintenance of visual stability in the human posterior parietal cortex. J. Cogn. Neurosci., 19: 266–274. Cohen, Y. E. and Andersen, R. A. (2002). A common reference frame for movement plans in the posterior parietal cortex. Nature Rev. Neurosci., 3: 553–562. Colby, C. L. and Goldberg, M. E. (1999). Space and attention in parietal cortex. Annu. Rev. Neurosci., 22: 319–349. Crawford, J. D., Medendorp, W. P., and Marotta, J. J. (2004). Spatial transformations for eye hand coordination. J. Neurophysiol., 92: 10–19. Culham, J. C. and Valyear, K. F. (2006). Human parietal cortex in action. Curr. Opin. Neurobiol., 16: 205–212. De Vrijer, M., Medendorp, W. P., and Van Gisbergen, J. A. (2008). Shared computational mechanism for tilt compensation accounts for biased verticality percepts in motion and pattern vision. J. Neurophysiol., 99: 915–930. DeSouza, J. F., Dukelow, S. P., Gati, J. S., Menon, R. S., Andersen, R. A., and Vilis, T. (2000). Eye position signal modulates a human parietal pointing region during memory-guided movements. J. Neurosci., 20: 5835–5840. Duhamel, J. R., Colby, C. L., and Goldberg, M. E. (1992). The updating of the representation of visual space in parietal cortex by intended eye movements. Science, 255: 90–92. Fernández, C. and Goldberg, J. M. (1976). Physiology of peripheral neurons innervating otolith organs of the squirrel monkey. I. Response to static tilts and to long-duration centrifugal force. J. Neurophysiol., 39: 970–984. Golomb, J. D., Chun, M. M., and Mazer, J. A. (2008). The native coordinate system of spatial attention is retinotopic. J. Neurosci., 28: 10654–10662. Grefkes, C., Weiss, P. H., Zilles, K., and Fink, G. R. (2002). Crossmodal processing of object features in human anterior intraparietal cortex: an fMRI study implies equivalencies between humans and monkeys. Neuron., 35: 173–184. Guitton, D. (1992). Control of eye–head coordination during orienting gaze shifts. Trends Neurosci., 15: 174–179.

Representation of 3D action space Hallett, P. E. and Lightstone, A. D. (1976). Saccadic eye movements towards stimuli triggered by prior saccades. Vis. Res., 16: 99–106. Hayhoe, M. M., Shrivastava, A., Mruczek, R., and Pelz, J. B. (2003). Visual memory and motor planning in a natural task. J. Vis., 3: 49–63. Helmholtz, H. von (1867). Handbuch der Physiologischen Optik. Leipzig: Voss. Henriques, D. Y., Klier, E. M., Smith, M. A., Lowy, D., and Crawford, J. D. (1998). Gaze-centered remapping of remembered visual space in an open loop pointing task. J. Neurosci., 18: 1583–1594. Jackson, S. R. and Husain, M. (2006). Visuomotor functions of the posterior parietal cortex. Neuropsychologia, 44: 2589–2593. Kandel, E. R., Schwartz, J. H., and Jessell, T. M. (eds.) (2000). Principles of Neural Science. New York: McGraw-Hill. Khan, A. Z., Pisella, L., Rossetti, Y., Vighetto, A., and Crawford, J. D. (2005). Impairment of gaze centered updating of reach targets in bilateral parietal occipital damaged patients. Cereb. Cortex, 15: 1547–1560. Klier, E. M., Angelaki, D. E., and Hess, B. J. (2005). Roles of gravitational cues and efference copy signals in the rotational updating of memory saccades. J. Neurophysiol., 94: 468–478. Klier, E. M., Angelaki, D. E., and Hess, B. J. (2007). Human visuospatial updating after noncommutative rotations. J. Neurophysiol., 98: 537–544. Klier, E. M., Hess, B. J., and Angelaki, D. E. (2008). Human visuospatial updating after passive translations in three dimensional space. J. Neurophysiol., 99: 1799–1809. Körding, K. P. and Wolpert, D. M. (2006). Bayesian decision theory in sensorimotor control. Trends Cogn. Sci., 10: 319–326. Krommenhoek, K. P. and Van Gisbergen, J. A. (1994). Evidence for nonretinal feedback in combined version-vergence eye movements. Exp. Brain Res., 102: 95–109. Laurens, J. and Droulez, J. (2007). Bayesian processing of vestibular information. Biol. Cybern., 96: 389–404. Li, N. and Angelaki, D. E. (2005). Updating visual space during motion in depth. Neuron, 48: 149–158. Li, N., Wei, M., and Angelaki, D. E. (2005). Primate memory saccade amplitude after intervened motion depends on target distance. J. Neurophysiol., 94: 722–733. MacNeilage, P. R., Banks, M. S., Berger, D. R., and Bülthoff, H. H. (2007). A Bayesian model of the disambiguation of gravitoinertial force by visual cues. Exp. Brain Res., 179: 263–290. MacNeilage, P. R., Ganesan, N., and Angelaki, D. E. (2008). Computational approaches to spatial orientation: from transfer functions to dynamic Bayesian inference. J. Neurophysiol., 100: 2981–2996. Mays, L. E. and Sparks, D. L. (1980). Saccades are spatially, not retinocentrically, coded. Science, 208: 1163–1165. McKenzie, A. and Lisberger, S. G. (1986). Properties of signals that determine the amplitude and direction of saccadic eye movements in monkeys. J. Neurophysiol., 56: 196–207.

205

206

W. P. Medendorp and S. Van Pelt Medendorp, W. P., Smith, M. A., Tweed, D. B., and Crawford, J. D. (2002). Rotational remapping in human spatial memory during eye and head motion. J. Neurosci., 22: 196RC. Medendorp, W. P., Goltz, H. C., Vilis, T., and Crawford, J. D. (2003a). Gaze centered updating of visual space in human parietal cortex. J. Neurosci., 23: 6209–6214. Medendorp, W. P., Tweed, D. B., and Crawford, J. D. (2003b). Motion parallax is computed in the updating of human spatial memory. J. Neurosci., 23: 8135–8142. Medendorp, W. P., Goltz, H. C., and Vilis, T. (2005). Remapping the remembered target location for anti-saccades in human posterior parietal cortex. J. Neurophysiol., 94: 734–740. Medendorp, W. P., Beurze, S. M., Van Pelt, S., and Van Der Werf, J. (2008). Behavioral and cortical mechanisms for spatial coding and action planning. Cortex, 44: 587–597. Melcher, D. and Colby, C. L. (2008). Trans-saccadic perception. Trends Cogn. Sci., 12: 466–473. Merriam, E. P., Genovese, C. R., and Colby, C. L. (2003). Spatial updating in human parietal cortex. Neuron, 39: 361–373. Merriam, E. P., Genovese, C. R., and Colby, C. L. (2007). Remapping in human visual cortex. J. Neurophysiol., 97: 1738–1755. Mittelstaedt, H. (1983). A new solution to the problem of the subjective vertical. Naturwissenschaften, 70: 272–281. Nakamura, K. and Colby, C. L. (2002). Updating of the visual representation in monkey striate and extrastriate cortex during saccades. Proc. Natl. Acad. Sci. USA, 99: 4026–4031. Niemeier, M., Crawford, J. D., and Tweed, D. B. (2003). Optimal transsaccadic integration explains distorted spatial perception. Nature, 422: 76–80. Olson, C. R. and Gettner, S. N. (1996). Brain representation of object-centered space. Curr. Opin. Neurobiol., 6: 165–170. Pouget, A., Deneve, S., and Duhamel, J. R. (2002). A computational perspective on the neural basis of multisensory spatial representations. Nature Rev. Neurosci., 3: 741–747. Randall, D., Burggren, W., and French, K. (eds.) (1997). Eckert Animal Physiology: Mechanisms and Adaptations. New York: W. H. Freeman. Soechting, J. F. and Flanders, M. (1992). Moving in three dimensional space: frames of reference, vectors, and coordinate systems. Annu. Rev. Neurosci., 15: 167–191. Sommer, M. A. and Wurtz, R. H. (2008). Brain circuits for the internal monitoring of movements. Annu. Rev. Neurosci., 31: 317–338. Sperry, R. W. (1950). Neural basis of the spontaneous optokinetic response produced by visual inversion. J. Comp. Physiol. Psychol., 43: 482–489. Umeno, M. M. and Goldberg, M. E. (1997). Spatial processing in the monkey frontal eye ﬁeld. I. Predictive visual responses. J. Neurophysiol., 78: 1373–1383. Van Der Werf, J., Jensen, O., Fries, P., and Medendorp, W. P. (2008). Gamma-band activity in human posterior parietal cortex encodes the motor goal during delayed prosaccades and antisaccades. J. Neurosci., 28: 8397–8405.

Representation of 3D action space Van Pelt, S. and Medendorp, W. P. (2007). Gaze-centered updating of remembered visual space during active whole-body translations. J. Neurophysiol., 97: 1209–1220. Van Pelt, S. and Medendorp, W. P. (2008). Updating target distance across eye movements in depth. J. Neurophysiol., 99: 2281–2290. Van Pelt, S., Van Gisbergen, J. A. M., and Medendorp, W. P. (2005). Visuospatial memory computations during whole-body rotations in roll. J. Neurophysiol., 94: 1432–1442. Vaziri, S., Diedrichsen, J., and Shadmehr, R. (2006). Why does the brain predict sensory consequences of oculomotor commands? Optimal integration of the predicted and the actual sensory feedback. J. Neurosci., 26: 4188–4197. Vindras, P., Desmurget, M., and Viviani, P. (2005). Error parsing in visuomotor pointing reveals independent processing of amplitude and direction. J. Neurophysiol., 94: 1212–1224. Vingerhoets, R. A., Van Gisbergen, J. A., and Medendorp, W. P. (2007). Verticality perception during off-vertical axis rotation. J. Neurophysiol., 97: 3256–3268. Vliegen, J., Van Grootel, T. J., and Van Opstal, A. J. (2005). Gaze orienting in dynamic visual double steps. J. Neurophysiol., 94: 4300–4313. Von Holst, E. and Mittelstaedt, H. (1950). The reafferent principle: reciprocal effects between central nervous system and periphery. Naturwissenschaften, 37: 464–476. Waespe, W. and Henn, V. (1977). Neuronal activity in the vestibular nuclei of the alert monkey during vestibular and optokinetic stimulation. Exp. Brain Res., 27: 523–538. Walker, M. F., Fitzgibbon, J., and Goldberg, M. E. (1995). Neurons of the monkey superior colliculus predict the visual result of impeding saccadic eye movements. J. Neurophysiol., 73: 1988–2003. Wang, X., Zhang, M., Cohen, I. S., and Goldberg, M. E. (2007). The proprioceptive representation of eye position in monkey primary somatosensory cortex. Nature Neurosci., 10: 640–646. Young, L. R., Oman, C. M., Watt, D. G., Money, K. E., and Lichtenberg, B. K. (1984). Spatial orientation in weightlessness and readaptation to earth’s gravity. Science, 225: 205–208.

207

10

Binocular motion-in-depth perception: contributions of eye movements and retinal-motion signals julie m. harris and harold t. nefs

10.1

Introduction

When an object in the world moves relative to the eye, the image of the object moves across the retina. Motion that occurs on the retina is referred to as retinal motion. When objects move within our visual ﬁeld we tend to move our eyes, head, and body to track them in order to keep them sharply focused on the fovea, the region of the retina with the highest spatial resolution. When the eyes move to track the object, there is no retinal motion if the tracking is perfect (Figure 10.1), yet we still perceive object motion. Retinal motion is therefore not the only signal required for motion perception. In this chapter, we discuss the problem of how retinal motion and eye movements are integrated for motion perception. After introducing the problem of representing position and motion in three-dimensional space, we will concentrate speciﬁcally on the topic of how retinal and eye-movement signals contribute to the perception of motion in depth. To conclude, we discuss what we have learned about how the combination of eye movements and retinal motion differs between the perception of frontoparallel motion and the perception of motion in depth.

10.2

A headcentric framework for motion perception

Position (and motion) in the physical three-dimensional world can be described in a number of different ways. For example, it can be described in Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

208

Binocular motion-in-depth perception

(a)

(b)

Figure 10.1 (a) If the eyes ﬁxate on a stationary reference while a target object moves, equal retinal motion occurs in each eye, corresponding to the target’s motion in the world. (b) If the eyes move to follow the object, there is no retinal motion, but the eyes move through an angle consistent with the motion of the object. If the brain can monitor how the eyes move, this signal is potentially available for obtaining object motion.

Cartesian coordinates (x, y, z) or in terms of angles and distances with respect to a certain origin. The origin can be chosen quite arbitrarily: it can be associated with the center of the Earth, but also with respect to the center of one’s own body or an eye. Some choices are, however, more convenient than others. For example, Galileo described the motion of the planets in a heliocentric system, where the planets, including the Earth, rotate around the Sun. Although Galileo’s heliocentric description is more compact and more convenient, the description is no more accurate than if the motions of the planets were described in a geocentric framework, relative to the Earth. Furthermore, as stargazers in Scotland, we might be more interested in the trajectories of the planets with respect to Scotland than in their position with respect to the Sun. Because we want to actually look at the planets, we prefer to read the position and nightly trajectories of the planets in a Scotland-centric map of the sky. Alas, Galileo was not able to communicate this subtlety to the Catholic Church; but, then, the Catholic Church was not really known for subtlety in those days. The point we make here is that the choice of the origin of the framework and the choice of the form of the parameters is essentially arbitrary. In visual perception, we are confronted with the problem of what position and motion, as registered on the retina, mean with respect to our heads, our bodies, or the Earth, depending on what task we have to undertake. We start by

209

210

J. M. Harris and H. T. Nefs

Sh Sb

H

B

S

(a)

γµ µ

r y

λ

z

x

(b) Figure 10.2 (a) The position of an object (S) can be described with respect to many different origins, such as the body (B) or the head (H). (b) Deﬁnitions of the azimuth (µ), elevation (λ), and azimuthal disparity (γ ) in a headcentric framework.

deﬁning our coordinate system and related terms. As illustrated in Figure 10.2a, the position of an object with respect to an origin is a vector S and the position of the origin with respect to the body is the vector B. The position of the object with respect to the body (Sb ) is then the vector sum of S and B (note the directions of the vectors). Likewise, the position of a target with respect to the head (that is, in headcentric coordinates) is the sum Sh = S + B + H.

Binocular motion-in-depth perception We have illustrated a headcentric framework further in Figure 10.2b. We choose the origin of the framework (shown by the dotted lines) to be directly between the eyes, a position we call the “cyclopean eye.” We express an object’s position in angular terms by its azimuth (µ), elevation (λ), and azimuthal disparity (γµ ), the angle subtended by the lines joining the right eye and the left eye to the object. These three terms specify a unique point in threedimensional space,1 and it is straightforward to convert these to Cartesian (x, y, z) coordinates: x = r sin(µ) y = −r cos(µ) sin(λ) , z = r cos(µ) cos(λ)

where r =

i cos(µ) ± cos2 (µ) + tan2 (γµ ) 2 tan(γµ )

(10.1)

and i is the interpupillary distance. In Eq. (10.1), r describes the shortest distance from the object to the cyclopean eye. Note that in Eq. (10.1), some factors vanish when the azimuthal disparity is small and/or when the azimuth is close to 0 (i.e., straight ahead), making the equations much simpler. In contrast to most equations found in the literature, this set of equations is exact and holds for the entire three-dimensional space. So far, we have derived equations to express the Cartesian geocentric location (x, y, z) in terms of headcentric coordinates (µ, λ, γ ). From Eq. (10.1), it is also straightforward to calculate the correspondence between motion in Cartesian coordinates (x, y, z) and motion in headcentric coordinates (µ, λ, γ ), since motion is simply the derivative of position with respect to time. The equations become horribly long in many cases, but there is still a one-to-one mapping. However, as we said before, the form of the description does not change its correctness, only its convenience of use. Close to the midline, the equations for lateral motion and motion in depth become much simpler, as was, for example, derived by Sumnall and Harris (2002); these equations are given in Eqs. (10.2) and (10.3) below: x = (r sin(µ)) ;

iµ ; tan(γ ) µr + µl x = Dµ = D ; 2

when µ → 0 and λ → 0, i using tan(γ ) = , D

1

x = rµ =

(10.2)

Mathematically, there is a second solution behind the head, but we accept only the solution that is actually in the visual ﬁeld, that is, in front of the head.

211

212

J. M. Harris and H. T. Nefs

z = (r cos(µ) cos(λ)) ; when µ → 0 and λ → 0, using sin(γ ) =

i i ≈ , r D

i iγ ; = tan(γ ) sin2 (γ ) D2 γ D2 (µl − µr ) z ≈ = . i i

z = r =

(10.3)

These derivations show that position and motion in depth can, in principle, be obtained from parameters (azimuth, elevation, and azimuthal disparity) to which the brain has potential access. So far, we have derived equations that link motion in a headcentric frame with motion in a world-based, or geocentric, frame of reference. Our problems are not yet solved! Since the retinas are not directly attached to the head but attached to the head via the eyes, the brain must somehow integrate its estimation of the eye position and eye motion with retinal signals, in order to make an accurate headcentric estimation of real-world motion. The signals in the brain that indicate the changes in the orientations of the eyes (or of the head or body) are usually referred to as extraretinal signals (because they potentially convey information about motion that is separate from the retinal-motion signal). The set of parameters in the headcentric model of Eqs. (10.1)–(10.3) thus doubles. Furthermore, note that other information, for example about the viewing distance D, is also required. Thus, the problem of integrating retinal and extraretinal signals in a full 3D world is not straightforward. 10.3

Are retinal and extraretinal sources of motion information separate?

As with all perceptual systems, the estimations that the system makes are prone to errors and miscalibration, and hence are not as accurate as one would hope. A key point is that the systematic errors in the use of retinal and extraretinal information can be different (and are sometimes assumed to be completely independent). This point can be illustrated by some classic perceptual effects. For example, in the Aubert–Fleischl illusion (Aubert, 1886; Fleischl, 1882), objects are perceived as moving slower when they are pursued with the eyes than when the eyes are held stationary (see, e.g., Festinger et al., 1976). In the former case, the eyes move and the image of the object is stationary (or nearly so) on the retinas; in the latter case, the image moves, whereas the eyes are stationary. The illusion can be explained by assuming an imbalance between independent retinal and extraretinal signals. For example, the brain may underestimate the motion of the eyes relative to equivalent motion on the retina. Extraretinal signals may come from feedback from the eye muscles or from efference copies of the motor program. However, extraretinal and retinal signals

Binocular motion-in-depth perception are not completely independent. Rather, information about eye orientation and eye movement can be obtained from retinal signals. For example, Erkelens and van Ee (1998) have suggested that measures of elevation disparity can be used to cross-calibrate the state of the eyes with the state of the retinas. The elevation disparity is the difference between the vertical angles of the object in each eye. If the eyes are correctly aligned, the elevation disparity is zero across the entire ﬁeld. If not, then specialist mechanisms could register the signal and compensate for the different eye alignments. Then, if a nonzero elevation disparity is still registered, it will mean that the assumed state of the eye or the assumed state of the retinas is incorrect, signaling that recalibration is in order. Erkelens and Van Ee illustrated their suggestions with demonstrations of how a frontoparallel plane appears deformed when the eyes and the retinas are miscalibrated. These retinally available measures of eye position only work properly when the ﬁeld of view is large and sufﬁciently ﬁlled with visual elements. Nevertheless, this is a way in which retinal signals may lead to a rebalancing of the retinal images and the state of the eyes. The absence of a sufﬁciently large ﬁlled visual ﬁeld may leave the system in an undeﬁned state. It may inﬂuence perception in strange ways since, for example, the absence of elevation disparity does not mean that the elevation disparity is estimated to be zero, nor that elevation disparity has no “weight.” The potential of retinal signals to act as a source for the brain to recover information about eye position and eye movement has been known for a long time. For example, the phenomenon of induced motion occurs when a small object is perceived as moving against a larger background when, in fact, the small object is stationary and the background moves. This suggests that large backgrounds tend to function as reference frames for other motion in the visual ﬁeld (see, e.g., Mack, 1986). When the eyes move, the background scene moves as a whole over the retina. Large areas of retinal motion therefore indicate motion of the eye or head rather than motion of the background in a geocentric framework. Because the extraretinal signals appear to be inﬂuenced by retinal signals, Wertheim (1994) preferred to speak of a reference signal rather than an extraretinal signal. Thus the situation is rather complicated. Not only must retinal and extraretinal signals be combined for veridical perception, but also the former can inﬂuence the latter, making their relative contributions challenging to study. 10.4

Can we understand 3D motion perception as a simple extension of lateral-motion perception?

It should now be clear to the reader that the perception of motion in three-dimensional space and the interaction between retinal and extraretinal

213

214

J. M. Harris and H. T. Nefs signals is far more complicated than the special case of lateral-motion perception (motion that does not contain a component in depth). It has often, at least implicitly, been assumed that lateral-motion perception takes places in one- or two-dimensional space rather than three-dimensional space. A popular model for lateral-motion perception is the linear-transducer model. In this model perceived object motion is modeled as a weighted sum of extraretinal and retinal signals. Errors in the use of the signals are usually expressed as gains, where the gain is the ratio of, say, the eye signal used to the retinal signal used. A gain less than 1 means that the eye signal is weighted less than the retinal signal. This model is very elegant in the sense that it describes experimental results rather well and in simple terms (e.g., Freeman and Banks, 1998; Dyde and Harris, 2008). However, it does not reveal the complexity of what the retinal and extraretinal signals are supposed to represent, and there are many good reasons why the research on lateral-motion perception cannot be generalized to motion perception in three-dimensional space. First, lateral motion takes place in three-dimensional space and is not invariant with respect to distance from the observer (e.g., Swanston and Wade, 1988). That is, lateral motion of the same angular size corresponds to very different lateral motion in Cartesian coordinates (in meters and meters per second) depending on whether the object is near or far away. Angular lateral motion and motion in depth also scale differently with increasing distance (Figure 10.3) in Cartesian coordinates. From Eq. (10.2), it can be seen that lateral motion scales linearly with distance, but that motion in depth scales in a quadratic fashion with distance. That is, when the distance increases, the ratio between lateral motion and motion in depth for equivalent angular motion does not remain

(a)

(b)

Figure 10.3 The same magnitude of motion in the world corresponds to very different retinal motions depending on whether the motion is lateral motion (a), shown here along the x axis, or motion in depth (b) along the z axis.

Binocular motion-in-depth perception constant in metric units. In order for the brain to know how lateral motion and motion in depth compare with each other, the brain must take the distance of the object into account. A second reason that complicates the interpretation of 3D motion is that the visual mechanisms processing lateral motion and motion in depth are differentially sensitive. Typical speeds for motion in depth are much smaller in angular units than they are for lateral motion. We know from previous research that the motion discrimination thresholds for lateral motion go up rapidly as soon as the speed drops below about 4◦ /s (e.g., De Bruyn and Orban, 1988). Furthermore, more retinal motion is required to detect motion in depth than to detect lateral motion (Regan and Beverley, 1973; Tyler, 1971; Sumnall and Harris, 2000). Regan et al. (1986) showed that motion in depth could not be detected for large-ﬁeld scenes oscillating back and forth in depth over amplitudes of 35◦ of binocular disparity. The eyes followed the motion of the stimulus, such that the retinal amplitude was much smaller than 35◦ , but there was still a considerable amount of retinal slip, the residual-retinal motion signal caused by the eyes not exactly following the motion. The retinal slip was easily large enough that it could have been detected by the visual system if it were lateral motion instead of motion in depth. So not only is there less motion on the retina during motion in depth, but the visual system is also less sensitive to it. A third issue to consider is that quite different eye movement mechanisms are thought to operate when lateral motion and motion in depth are followed. To follow lateral motion, the eyes make the same, conjunctive, eye movements, called version. To follow motion in depth, the eyes make opposite, disjunctive, eye movements, called vergence (Figure 10.4). There is evidence that these two sorts of eye movement are controlled by at least partially independent mechanisms in the brain (Collewijn et al., 1995; Semmlow et al., 1998; Maxwell and Schor, 2004). Previous research on lateral-motion perception therefore cannot be generalized to the perception of motion in three-dimensional space, and we hold the view that an understanding of lateral-motion perception cannot be complete without considering motion perception in three-dimensional space. In the sections below, we review the recent work from our own and others’ laboratories, which have begun to study how motion perception in three-dimensional space is affected by eye movements. 10.5

How well do our eyes track a target moving in depth?

Before embarking on a discussion of the research, it is important to understand some of the technical issues involved in studying interactions

215

216

J. M. Harris and H. T. Nefs

(a)

(b)

(c)

Figure 10.4 (a) When an observer ﬁxates on an object at a certain depth (gray dot), its image will be foveated in each eye. The images of an object at a different depths (black dot) will be in different relative locations on the left- and right-eye retina. (b) If the eyes ﬁxate a stationary target while another object moves directly away, the motions will be equal and opposite in the two eyes. (c) If the eyes move to follow that target, there will be zero retinal motion of the moving target, but the image of the stationary target will move in the opposite direction.

between retinal and extraretinal information in vision. When the eyes move to follow a target, they do not do so perfectly. The eyes inevitably lag behind the target movement, especially for motions that are fast or are not at a constant velocity. Figure 10.5 shows an example from some of our own data (Nefs and Harris, 2007). Here, the stimulus undergoes sinusoidal motion in depth (180◦ out of phase between the left and right eyes), as shown in the ﬁrst two columns (top row). The second row shows each eye’s movement, superimposed on the stimulus. Notice that, when the stimulus slows and then changes direction, it takes each eye a little time to catch up. The third column shows the average of the leftand right-eye motions, the version component. The version is close to zero here. The fourth column shows the vergence (left minus right eye). The consequence of the overshoot in each eye’s movement results in an eye vergence that has a slightly larger amplitude than the stimulus itself, and there is a slight phase shift (or time lag) between the stimulus and the eyes. Eye movement analyses typically quantify the difference between the stimulus and the eye motion as a gain (the ratio of eye amplitude to stimulus amplitude) and an associated phase lag. It is important to understand that this is a totally different measure of “gain” from that described above, when we referred to the relative weighting of retinal and extraretinal signals in the transducer model. Here, “gain” refers to what the eyes are measured to do, how they move. The signal gain in the transducer

Binocular motion-in-depth perception Right

Version (L+R)/2

Vergence (L–R)

Follow

Stimulus motion

Left

1 sec

0.5°

Figure 10.5 Graphs showing stimulus motion (top row) and measured eye motion, superimposed on a low-pass-ﬁltered trajectory of the eye data (bottom row) for a stimulus moving along a sinusoidal-motion trajectory in depth. The ﬁrst column shows the left-eye motion, the second column the right eye motion, the third column the average of the two (version), and the fourth column the difference between the two (vergence). Reproduced, with permission, from Nefs and Harris (2007).

model refers to the relative strength between retinal and extraretinal signals. It is implicitly assumed in the transducer model that the “eye gain” is 1. When we refer to gain below, we refer speciﬁcally to eye gain. These issues are important when motion in depth is considered. In motion-indepth experiments, motions frequently follow a sinusoidal waveform because the experimenter wishes to remain in the region of parameter space where the stimulus is perceived as fused and single. It is known that the gain is reduced and the phase lag increases as the frequency of the sinusoidal motion in depth increases (Erkelens and Collewijn, 1985). Erkelens and Collewijn also showed that the gain falls as the maximum velocity increases. Thus, when an observer is asked to ﬁxate on a target undergoing sinusoidal motion in depth, they will not be able to track that target perfectly. This means that there will not be a pure extraretinal signal; instead, there will also be a small retinal signal, which will vary from trial to trial. This is an inherent limitation in all studies that compare retinal and extraretinal signals and should be borne in mind when interpreting the results of such studies.

10.6

Are eye movements sufficient for motion-in-depth perception?

We start by considering the extent to which extraretinal information alone can be used to detect motion in depth. If a small target object moves laterally in an otherwise completely dark environment, we make errors in detecting

217

218

J. M. Harris and H. T. Nefs

Figure 10.6 When an object moves towards an observer its retinal image size increases, as shown by the angular sizes of the image for the far (gray lines) and near (black lines) object.

how fast it moves (the Aubert–Fleischl effect), but it still appears to move (e.g., Freeman, 2001). A body of literature from the 1980s suggested that motion is not perceived at all when a target moves in depth, if no other visible reference is present (Erkelens and Collewijn, 1985; Regan et al., 1986). The stimuli used were typically large, consisting of ﬁelds of random dots that were moved back and forth in depth. When a real object moves in depth, it undergoes looming: its retinal image gets larger as it approaches and smaller as it recedes (Figure 10.6). The boundary of the stimulus moved in the above studies was not easily visible, and the dots comprising the stimulus did not change in retinal size as they moved back and forth, so it appears that monocular cues, telling the brain that the background is stationary, may be strong enough to overcome the signals from the eye movement. More recent work has shown that when looming and binocular cues are not in conﬂict, motion in depth can be seen (Brenner et al., 1996). Harris (2006) compared detection of motion in depth for a small target (with a constant screen size of 8.3 arcmin) presented either alone, in an otherwise dark room, or in the presence of a small stationary reference point. The target was small enough to be below the threshold for looming (in other words, if it had loomed, the change in size would not have been detectable (Gray and Regan, 1999)). Figure 10.7 shows data from Harris (2006); observers found it much more difﬁcult to detect the motion when there was no reference, but it

Binocular motion-in-depth perception 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0 –0.3

–0.2

–0.1

0 0

0.1

0.2

0.3

–0.3

–0.2

–0.1

0

0.1

0.2

0.3

1 0.8 0.6 0.4 0.2 0 –0.3

–0.2

–0.1

0

0.1

0.2

0.3

Figure 10.7 Proportion moved towards as a function of distance moved, for three observers, when a small target moved in the presence of a stationary reference point (black symbols and lines), or when the moving target was presented alone in the dark (gray symbols and lines). Data replotted from Harris (2006).

was still possible to do so. If the observers were able to exactly follow the target, the data in Figure 10.7 would suggest that there is an extraretinal signal that can support the perception of motion in depth. Harris did not explicitly measure eye movements, so we cannot know for sure. A more recent study did measure vergence eye movements while observers were asked to track a small target moving in depth in conjunction with a large reference surface (Welchman et al., 2009). At the end of the motion sequence, the small target moved on in isolation (the background disappeared). Here, we were able to determine, in each trial, what the actual retinal and extraretinal motions were, because we could calculate the vergence gain (how well the eyes keep up with the target) in each trial, and the observers’ response (what direction did the target move in after the background had disappeared?). We found that extraretinal information could be sufﬁcient for motion-in-depth perception and that when both extraretinal and retinal signals were available, for a small point stimulus, both seemed to be used. In the next section, we discuss how perceived motion is affected by imperfections in the combination of retinal and extraretinal sources of information about motion in depth.

219

220

J. M. Harris and H. T. Nefs 10.7

How does the combination of retinal and extraretinal information affect the speed of motion in depth?

Very few studies have sought to explore the relative contributions of retinal and extraretinal information to the perception of motion in depth, and how well the visual system compensates for eye movements that occur during motion in depth. We have explored this issue in a pair of recent publications (Nefs and Harris, 2007, 2008). We started by exploring whether an Aubert–Fleischl-like phenomenon could be obtained during the perception of motion in depth. We studied two motionin-depth conditions. The visual stimulus was identical in both conditions, shown in Figure 10.8. A small target object oscillated with a sinusoidal motion, back and forth in depth. Both its retinal size (looming) and its change in binocular disparity were consistent with a real object moving in depth. A second, stationary target was presented above the moving one. In condition 1, observers were asked to ﬁxate on the stationary target (Figure 10.8a) and the moving target moved at one of several different test speeds. In condition 2, observers ﬁxated on the moving target (Figure 10.8b), which moved at a ﬁxed, standard speed. Observers were shown stimuli in two intervals, one containing condition 1 and the other condition 2, and they were asked in which interval the target appeared to move faster. If the observers were able to completely compensate for their eye movements, then there would be no difference between the conditions. The classic Aubert–Fleischl effect predicts that the motion should look slower when

(a)

(b)

Figure 10.8 Stimulus conditions for the Aubert–Fleischl depth experiment. (a) Observers ﬁxated a stationary reference object while a target object moved in depth. (b) Observers ﬁxated the moving target object (the stationary reference object was still present).

Binocular motion-in-depth perception

0.8 0.6

2.5 0 –2.5 –5 –7.5 –10 –12.5

Threshold

Bias (%)

Probability

1

Bias

0.4 0.2 –29% –24 –19 –13 –7 0% 7 15 23 32 41% Speed step (%)

(a)

HTN JMH EPH HRB CEG VFD

(b)

Figure 10.9 (a) Example plot of the probability that motion was seen as faster in condition 1, as a function of the percentage speed difference between the two intervals. The bias is measured as the difference between the speeds corresponding to a proportion of 0.5 correct and to a 0% speed difference. (b) Biases for each of six observers. Data reproduced from Nefs and Harris (2007), with permission.

followed with the eyes (condition 2) than when the eyes ﬁxate on the stationary target (condition 1) (Dichgans et al., 1975). If we plot the probability that the motion was perceived as faster in condition 1 (eyes moving) as a function of the test speed (always shown in condition 2), then there should be a bias. The 50% point (the point where the speeds in the two conditions look identical) should be shifted to a lower speed; the test speed (condition 1) that corresponds to the eyes-moving speed (condition 2) will be smaller than the actual speed presented in condition 2. This is typically what was found. Figure 10.9a shows an example psychometric function, where the 50% point has a negative bias, slightly below the standard speed (0% on the x axis). Figure 10.9b summarizes the results: all observers showed a negative bias, suggesting that motion in depth does look slightly slower when the eyes follow a moving target. Overall, there does appear to be a small Aubert–Fleischl effect, of around 4%.

10.8

Do eye movements affect the perception of aspects of motion other than speed?

The Filehne illusion (Filehne, 1922) occurs when a stationary target in a scene appears to move when the eyes follow a different target. A related phenomenon is that of induced motion, also known as the Duncker illusion (Duncker, 1929/1938), where a background region of a scene physically moves, but motion is perceived in a stationary target instead (or sometimes as well). This phenomenon has been explored in some detail for lateral motion (for a review, see Mack (1986)) and to some extent for binocular motion in depth (Gogel and Grifﬁn, 1982). Harris and German (2008) have shown that the extent of induced

221

222

J. M. Harris and H. T. Nefs motion is similar for lateral motion and motion in depth, for similar magnitudes of motion at the retina.2 Very few studies have explored how following eye movements affect the perception of induced motion, for either lateral motion or motion in depth. Two studies that have explored the interaction between eye movements and induced motion in depth were those of Gogel and Grifﬁn (1982) and Likova and Tyler (2003). Neither found big differences between whether observers ﬁxated the target moving in depth or the stationary background, though Gogel and Grifﬁn (1982) did ﬁnd larger induced motion in depth when the moving target was ﬁxated, but only when the motion was at a low frequency. This could be an important point, because the eyes will not be able to follow the motion perfectly for higher frequencies. It is possible that the larger induced motion occurred for low frequencies because there was a greater eye movement gain, but, as the study did not measure eye movements, this can only be speculation. We have explored whether eye movements affect induced motion for a small stimulus moving in depth, which could have one or the other of binocular (disparity) or monocular (looming) information available, or both (Nefs and Harris, 2008). Observers viewed a pair of targets, an inducer target that oscillated sinusoidally in depth, and a test target that moved with a smaller amplitude and could be in or out of phase with the inducer target. We measured the point of subjective stationarity (the point where the test appeared not to move in depth at all) by using a forced-choice procedure where observers were asked whether the test target and inducer target moved in or out of phase with one another. As well as using three stimulus conditions, we also used two eye conditions. Observers were asked to look either at the inducer target or at the test target (for the latter, the motions were of very small amplitude, so this is as close as the design would allow to an eyes-not-moving condition). Binocular eye movements were recorded throughout. Induced motion was found for all conditions to some extent. In other words, the inducer always had some impact: the test target appeared to be moving slightly, in a direction opposite to the target, when it was physically stationary. First, consider the case when both monocular and binocular cues to motion in depth were present. The left of Figure 10.10a shows a plot, for one observer, of the proportion of occasions when that observer reported the targets to be in phase, as a function of the test amplitude. The light symbols show data for the eyes-on-test condition. Here, notice a small bias in the data: the point of subjective stationarity (0.5 on the y axis, because this performance, by chance, 2

This corresponds to considerably more motion in depth than lateral motion in the world, for the conditions they used; see Eq. (10.2).

Binocular motion-in-depth perception Psychometric function

P("in phase")

(c) DISPARITY ONLY 1 0.8 0.6 0.4 0.2 –75%–50 –25 0% 25 50 75% Test amplitude (%)

Threshold (%)

Bias (%)

80 60 40 20 0 –20

80 60 40 20 0 –20

Threshold (%)

1 0.8 0.6 0.4 0.2

80 60 40 20 0 –20

80 60 40 20 0 –20

Threshold (%)

P("in phase")

(b) SIZE ONLY

Threshold 80 60 40 20 0 –20

Bias (%)

1 0.8 0.6 0.4 0.2

Bias 80 60 40 20 0 –20

Bias (%)

P ("in phase")

(a) ALL CUES

HN JH SA RG JB Eyes on inducer

All

HN JH SA RG JB

All

Eyes on test

Figure 10.10 The left column shows plots of the probability of perceiving the inducer and test targets to be in phase, as a function of the phase difference between them. The right column shows the biases for each observer (where bias = amplitude corresponding to 50% in phase). (a) Data from stimuli containing both binocular and looming cues when the eyes were ﬁxated on the target (light symbols) or the inducer (dark symbols). The biases are large only for the eyes-on-inducer condition. (b) There is almost no bias when the motion in depth is speciﬁced only by looming. (c) The biases in the binocular-only condition suggest that the effects in (a) are due entirely to the binocular cue. Figure reproduced, with permission, from Nefs and Harris (2008).

reﬂects the situation where the observer sees no motion at this response proportion), is not at zero phase but is at a slightly positive phase (indicating that the test target appears to undergo motion opposite to that of the inducer target). This is the classic induced motion effect. For the eyes-following condition (dark symbols), the effect is much larger, reﬂecting much greater induced motion. The biases for all observers are shown in the second column, illustrating that the induced motion is absent or small (averaging around 4%) when the observer views the stimulus that is not moving, and much larger (averaging around 40%) when the eyes follow the motion of the inducer target. Figures 10.10b and c show data when the motion in depth was deﬁned by looming only or by disparity only. There is almost no effect with the size-change cue and a large effect with the disparity-change cue, suggesting that the latter is the source of information involved in generating the percept of induced motion.

223

224

J. M. Harris and H. T. Nefs

10.9

Comparing Aubert–Fleischl and induced motion effects for lateral motion and motion in depth

While the effects on motion perception of moving eyes within a moving head on a moving body have long been studied, very few studies have explored motions (and body movements) that occur when an object moves in depth. In Section 10.6 above, we suggested that the results from studies on lateral motion might well not predict what happens when objects move in depth, owing to the motion ranges used, the possibility that different visual mechanisms are used to perceive retinal motion in depth (Harris et al., 1998), and the different eye movement mechanisms used for vergence compared with version. Below, we consider whether the effects are in any way comparable. This topic is of interest because by comparing lateral motion and motion in depth we can understand more about the fundamental mechanisms that control each type of motion. Comparisons between induced lateral motion and induced motion in depth have been rare. Harris and German (2008) measured induced motion for lateral motion and motion in depth of small point targets, where the magnitude of motion on the retinas was equal. They found no difference between the amounts of motion induced. They did not ask observers to ﬁxate on any particular location, and did not measure eye movements. Nefs and Harris (2008) explored induced motion in depth for larger stimuli, which also loomed as they approached, and found big differences in performance when the eyes ﬁxated the moving inducer or the near-stationary target. We also set up an induced-lateral-motion experiment, using the same magnitude of motions on the retina as was used for motion in depth. Figure 10.11 shows data from this experiment, with the motion-in-depth data in Figure 10.11a and the lateral-motion data in Figure 10.11b. Notice the differences between the two rows. For lateral motion, there is still a larger induced motion effect when the eyes follow the moving target than when they ﬁxate the test target, but the induced motion is, on average, slightly bigger for each of these conditions than for the equivalent motion in depth. A possible explanation for the larger induced motion found with lateral motion is that the additional visual information provided by the looming cue could act to suppress the induced motion. This suppression would be maximal when the observer ﬁxated the target (hence little induced motion) and smaller when the inducer was ﬁxated (because the target is now viewed in peripheral vision), and would be absent for lateral motion (where there is no looming for lateral motion). Of course, there could be other reasons, linked to the differential mechanisms that may be present for lateral motion and motion in depth (see, e.g.,

Binocular motion-in-depth perception Psychometric function

P("in phase")

(b) FRONTO-PARALLEL 1 0.8 0.6 0.4 0.2

–75% –50 –25 0% 25 50 75% Amplitude test target

Bias

Threshold 80 60 40 20 0 –20

80 60 40 20 0 –20

80 60 40 20 0 –20

Bias (%)

80 60 40 20 0 –20

Bias (%)

P("in phase")

(a) IN DEPTH 1 0.8 0.6 0.4 0.2

HN JH LH VD CD CG Eyes on inducer

All

HN JH LH VD CD CG

All

Eyes on test

Figure 10.11 The data are in the same format as in Figure 10.10. The top row shows the induced motion for motion in depth, and the bottom shows the induced motion for lateral motion. Figure reproduced, with permission, from Nefs and Harris (2008).

Harris et al., 2008), or to the different mechanisms responsible for version and vergence eye movements. None of the current data are able to rule these out, but the explanation above does account for the differences found in our own studies (Harris and German, 2008; Nefs and Harris, 2008).

10.10

Summary and conclusions

In this chapter, we have sought to explain how both retinal and extraretinal information are required to perceive objects moving in the world, and that when objects move in depth, the issues are subtly different from when objects move in the frontoparallel plane. We have discussed the current literature in this area, and attempted to compare effects for both lateral motion and motion in depth. It seems that similar effects are found for the two kinds of motion, but they are not identical. There are many other issues to consider in this area, not least the link between what the eyes are doing on each trial, and whether it is possible for us to explain human performance on a trial-by-trial basis from the measured eye movements. These and related issues are beyond the scope of this chapter, but more detailed work has begun to reveal some of the answers (see Nefs and Harris, 2007, 2008; Welchman et al., 2009). The research described is some of the very ﬁrst work to consider these issues for anything other than lateral motion. This work makes a start in our

225

226

J. M. Harris and H. T. Nefs understanding, but we still do not know whether differences arise because the brain mechanisms responsible for processing retinal, extraretinal, and the combination of both sorts of information are different. References Aubert, H. (1886). Die Bewegungsempﬁndung. Pﬂügers Archi., 39: 347–370. Brenner, E., Van den Berg, A. V., and Van Damme, W. J. (1996). Perceived motion in depth. Vis. Res., 36: 299–706. Collewijn, H., Erkelens, C. J., and Steinman, R. M. (1995). Voluntary binocular gaze-shifts in the plane of regard: dynamics of version and vergence. Vis. Res., 35(23/24): 3335–3358. De Bruyn, B. and Orban, G. A. (1988). Human velocity and direction discrimination measured with random dot patterns. Vis. Res., 28: 1323–1335. Dichgans J., Wist, E., Diener, H. C., and Brandt, T. (1975). The Aubert–Fleischl phenomenon: a temporal frequency effect on perceived velocity in afferent motion perception. Exp. Brain Res., 23: 529–533. Duncker, K. (1929/1938). Über induzierte Bewegung. Psychol. Forsch. Translated and condensed as “Induced motion.” In W. D. Ellis (ed.), A Source Book on Gestalt Psychology, Vol. 12, pp. 180–259. London: Kegan, Paul, Trench and Grubner. Dyde, R. T. and Harris, L. R. (2008). The inﬂuence of retinal and extra-retinal motion cues on perceived object motion during self-motion. J. Vis., 8(14): 5. Erkelens, C. J. and Collewijn, H. (1985). Eye movements and stereopsis during dichoptic viewing of moving random-dot stereograms. Vis. Res., 25: 1689–1700. Erkelens, C. J. and van Ee, R. (1998). A computational model of depth perception based on headcentric disparity. Vis. Res., 39: 2999–3018. Festinger, L., Sedgwick, H. A., and Holtzman, J. D. (1976). Visual perception during smooth pursuit eye movements. Vis. Res., 16: 1377–1386. Filehne, W. (1922). Über das optische Wahrnehmen von Bewegungen. Z. Sinnesphysiol., 53: 134–144. Fleischl, E. V. (1882). Physiologisch-optische Notizen, 2. Mitteilung. Sitz. Wien. Bereich Akad. Wiss., 3: 7–25. Freeman, T. C. A. (2001). Transducer models of head-centric motion perception. Vis. Res., 41: 2741–2755. Freeman, T. C. A. and Banks, M. S. (1998). Perceived headcentric speed is affected by both extra-retinal and retinal errors. Vis. Res., 38: 941–945. Gogel, W. C. and Grifﬁn, B. W. (1982). Spatial induction of illusory motion. Perception, 11: 187–199. Gray, R. and Regan, D. (1999). Adapting to expansion increases perceived time-to-collision. Vis. Res., 39: 3602–3607. Harris, J. M. (2006). The interaction of eye movements and retinal signals during the perception of 3D motion direction. J. Vis., 6(8): 444–790. Harris, J. M. and German, K. J. (2008). Comparing motion induction in lateral motion and motion in depth. Vis. Res., 48: 695–702.

Binocular motion-in-depth perception Harris, J. M., McKee, S. P., and Watamaniuk, S. N. (1998). Visual search for motion-in-depth: stereomotion does not “pop out” from disparity noise. Nature Neurosci., 1: 165–168. Harris, J. M., Nefs, H. T., and Grafton, C. E. (2008). Binocular vision and motion in depth. Spa. Vis., 21: 531–547. Likova, L. T. and Tyler, C. W. (2003). Spatiotemporal relationships in a dynamic scene: stereomotion induction and suppression. J. Vis., 3(5): 304–317. Mack, A. (1986). Perceptual aspects of motion in the fronto-parallel plane. In K. R. Boff, L. Kaufman, and J. P. Thomas (eds.), Handbook of Perception and Human Performance: I. Sensory Processes and Perception, chapter 17, pp. 1–38. New York: Wiley. Maxwell, J. S. and Schor, C. M. (2004). Symmetrical horizontal vergence contributes to the asymmetrical pursuit of targets in depth. Vis. Res., 44: 3015–3024. Nefs, H. T. and Harris, J. M. (2007). Vergence effects on the perception of motion-in-depth. Exp. Brain Res., 183: 313–322. Nefs, H. T. and Harris, J. M. (2008). Induced motion-in-depth and the effects of vergence eye movements. J. Vis., 8: 1–16. Regan, D. and Beverley, K. I. (1973). Some dynamic features of depth perception. Vis. Res., 13: 2369–2378. Regan, D., Erkelens, C. J., and Collewijn, H. (1986). Necessary conditions for the perception of motion in depth. Invest. Ophthalmol. Vis. Sci., 27: 584–597. Semmlow, J. L., Yuan, W., and Alvarez, T. L. (1998). Evidence for separate control of slow version and vergence eye movements: support for Herings law. Vis. Res., 38(8): 1145–1152. Sumnall, J. H. and Harris, J. M. (2000). Binocular 3D motion: contributions of lateral and stereomotion. J. Opt. Soc. Am. A, 17: 687–696. Sumnall, J. H. and Harris, J. M. (2002). Minimum displacement thresholds for binocular three-dimensional motion. Vis. Res., 42: 715–724. Swanston, M. T. and Wade, N. J. (1988). The perception of visual motion during movements of the eyes and of the head. Percept. Psychophys., 43: 559–566. Tyler, C. W. (1971). Stereoscopic depth movement: two eyes less sensitive than one. Science, 174: 958–961. Welchman, A. E., Harris, J. M., and Brenner, E. (2009). Extra-retinal signals support the estimation of 3D motion. Vis. Res., 49: 782–789. Wertheim, A. H. (1994). Motion perception during self-motion: the direct versus inferential controversy revisited. Behav. Brain Sci., 17: 293–355.

227

11

A surprising problem in navigation yogesh girdhar and gregory dudek

11.1

Introduction

Navigation tasks, and particularly robot navigation, are tasks that are closely associated with data collection. Even a tourist on holiday devotes extensive effort to reportage: the collection of images, narratives or recollections that provide a synopsis of the journey. Several years ago, the term vacation snapshot problem was coined to refer to the challenge of generating a sampling and navigation strategy (Bourque and Dudek, 2000). The notion of a navigation summary refers to a class of solutions to this problem that capture the diversity of sensor readings, and in particular images, experienced during an excursion without allowing for active alteration to the path being followed. An ideal navigation summary consists of a small set of images which are characteristic of the visual appearance of a robot’s trajectory and capture the essence of what was observed. These images represent not only the mean appearance of the trajectory but also its surprises. In the context of this chapter, we deﬁne a navigation summary to be a set of images (Figure 11.1) which minimizes surprise in the observation of the world. In Section 11.3, we present an information-theory-based formulation of surprise, suitable for the purpose of generating summaries. The decisions of selecting summary images can be made either ofﬂine or online. In this chapter we will discuss both versions of the problem, and present experimental results which highlight differences between the corresponding methods. In Section 11.4, we present two different ofﬂine strategies (Figure 11.2) for picking the summary images. First, we present a simple technique which picks Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

228

A surprising problem in navigation

Figure 11.1 An illustrated example of a navigation summary. The sequence of images represents the observations made by a robot as it traverses a terrain. The boxes indicate one possible choice of the summary images. These images capture both the surprise and the mean appearance of the terrain. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/ 9781107001756).

Navigation summaries

Max surprise

Offline

Online

Max cover

With replacement

Without replacement

Figure 11.2 Navigation summaries can be generated either ofﬂine or online. Ofﬂine, we can focus either on picking images which are the most surprising or on picking images which describe the maximum number of observations (maximum cover). Online, we can either make our image choices irrevocable or allow them to be replaced by a more suitable image from a future observation.

the most surprising images greedily. This is similar to picking the images at the corners of the high-dimensional manifold formed by the images in the information space. Another strategy is to model the problem as an instance of the

229

230

Y. Girdhar and G. Dudek classical set cover problem, which results in favoring images which describe a maximum number of observations. In Section 11.5, we present two different online strategies for picking the summary images: picking with replacement and picking irrevocably. The online version of the summarization problem is especially interesting in the context of robotics owing to the many resulting applications. Picking images with replacement could, for example, allow a surveillance robot to continuously monitor the world looking for surprises. When a surprising image is found, it could be brought to the attention of a human and then added to the summary set to update the current model of the appearance of the world. We discuss strategies to do this in Section 11.5.1. Adding the additional constraint of making the image choices irrevocable allows us to trigger physical actions which are limited by resource constraints. For example, consider the case of a planetary rover with a mission to drill and examine samples from different kinds of rocks. We can then link the event of selecting a summary image to selecting the drilling locations. We might only be allowed to drill a ﬁnite number of rocks, which would then correspond to picking only a ﬁnite number of images irrevocably. In Section 11.5.2, we discuss strategies to pick images irrevocably such that the probability that they are the most surprising images is maximized. Although navigation summaries are sometimes based on the use of GPS data as well as pure video information, this chapter is concerned with the acquisition and use of video data alone for the online summarization process. In our previous work (Girdhar and Dudek, 2009, 2010a, 2010b), we have looked at several different instances of the problem of generating navigation summaries. The goal of this chapter is to present a comprehensive picture of the summary problem. 11.2

Related work

The problem of identifying summary images is related to the problem of identifying landmark views in a view-based mapping system. A good example is the work by Konolige et al. (2009). In this work, the goal is to identify a set of representative views and the spatial constraints among these views. These views are then used to localize a robot. With this approach, we end up with the number of images being proportional to the length of the robot trajectory, and hence these view images do not satisfy our size criterion. There is related work by Ranganathan and Dellaert (2009), where the goal is to identify a set of landmark locations, and then build a topological map using them. The images selected by the system, however, although small in number

A surprising problem in navigation and well suited to building topological maps, are still not suitable for generating online navigation summaries. First, the algorithm used is an ofﬂine algorithm which requires building a vocabulary of visual words (Sivic and Zisserman, 2006) by clustering scale-invariant feature transform (SIFT) (Lowe, 2004) features extracted from all the observed images. Second, we are interested in selecting not only surprising landmark locations but also images which represent the typical (i.e., mean) appearance of the world. Chang et al. (1999) have used set cover methods to build a video-indexing and retrieval scheme. We use similar ideas to deﬁne our summary images. There is a body of literature on the related problem of ofﬂine video summarization, where we have random access to all the observed images. For example, Gong and Liu (2000) produced video summaries by exploiting a principalcomponent representation of the color space of the video frames. These authors used a set of local color histograms and computed a singular-value decomposition of these local histograms to capture the primary statistical properties (with respect to a linear model) of how the color distribution varied. This allowed them to detect frames whose color content deviated substantially from the typical frame as described by this model. Dudek and Lobos (2009) used a similar principal-components analysis technique but also included the coordinates of the images to produce navigation summaries. Ngo et al. (2005) ﬁrst modeled a video as a complete undirected graph and then used the normalized graph cut algorithm to partition the video into different clusters.

11.3

Surprise

11.3.1

Bayesian surprise

Itti and Baldi (2009) formally deﬁned the Bayesian surprise in terms of the difference between the posterior and prior beliefs about the world. They showed that observations which lead to a high Kullback–Leibler (KL) divergence (Kullback, 1959) between the posterior and prior visual-appearance hypotheses are very likely to attract human attention. The relative entropy, or KL divergence, between two probability mass functions p(x) and q(x) is deﬁned as dKL (pq) =

x∈X

p(x) log

p(x) . q(x)

(11.1)

The KL divergence can be interpreted as the inefﬁciency in coding a random variable from the distribution p when its distribution is assumed to be q.

231

232

Y. Girdhar and G. Dudek In this chapter, we represent the surprise by the symbol ξ : ξ = dKL (posteriorprior).

(11.2)

Suppose we have a set of summary images S = {Si }, which visually summarizes all our observations so far. Let F be a random variable representing the presence of some visual feature. For example, F could represent the presence of a given color or a visual word (Sivic and Zisserman, 2003, 2006). Let π − be the prior probability distribution over all such features: π − = P(F|S).

(11.3)

Similarly, we can deﬁne the posterior probability distribution π + after a new image Z has been observed: π + = P(F|Z, S).

(11.4)

Using Itti and Baldi’s deﬁnition of the surprise, we can then deﬁne the surprise ξ in observing an image Z, given a summary S, as

ξ(Z|S) = dKL π + π − .

(11.5)

The surprise ξ(Z|S) can be interpreted as the amount of information gained in observing Z. Ideally, we would like to choose a summary set such that information gained after observing any random image from the terrain is small. In such a case, this would imply that our summary images already contain most of the information about the world. 11.3.2

Set-theoretic surprise

Instead of modeling the appearance of the world with a single distribution of visual features, we propose to maintain a set of local hypotheses, each corresponding to an image in the summary set. This set of distributions can then be interpreted as the prior hypothesis for the appearance of the world. Using a set of distributions to model the appearance allows us to model arbitrary complexity in the visual appearance of the world by simply increasing or decreasing the size of the summary set. A deﬁnition of surprise using this set of local hypotheses can be computed in the following way (Girdhar and Dudek, 2010a). We deﬁne the prior hypotheses as a set of local hypotheses, each modeled by a distribution describing the probability of seeing a visual feature in the local region represented by a summary image: − = P(F|S1 ), . . . , P(F|Sk ) .

(11.6)

A surprising problem in navigation Similarly, we deﬁne the posterior hypothesis using the union of the prior hypotheses and the observation: + = P(F|S1 ), . . . , P(F|Sk ), P(F|Z) .

(11.7)

Now, analogously to the Bayesian surprise, we would like to measure the distance between these two sets of distributions. The Hausdorff metric provides a natural way to compute the distance between two such sets. For two sets A, B, the Hausdorff distance between the sets is deﬁned as

dH (A, B) = max sup inf d(a, b), sup inf d(a, b) . b∈B a∈A

a∈A b∈B

(11.8)

We deﬁne the set-theoretic surprise ξ ∗ as the Hausdorff distance between the sets of posterior and prior distributions, with the KL divergence as the distance metric: def ξ ∗ (Z|S) = dH,KL + − .

(11.9)

However, since − ⊆ + , and they differs only by one element, we obtain the following by expanding Eq. (11.9):

∗

ξ (Z|S) = max = sup

sup

+

−

inf dKL (π π ), 0

− − π + ∈+ π ∈

inf dKL (π + π − )

− − π + ∈+ π ∈

= inf dKL P(F|Z) π − . π − ∈−

(11.10) (11.11) (11.12)

This is visualized graphically in Figure 11.3. 11.3.3

Image representation

We can extract different kinds of features from an image to come up with feature distributions that we can use to compute the surprise. Here are some examples of features that can be used: •

•

Color histograms. This is probably the simplest of all image representations. A normalized color histogram can be used to represent the probability that the world has a given color. Of course, its lack of invariance with respect to illumination conditions makes it unsuitable for many scenarios. Gabor response histograms. A 2D Gabor ﬁlter is characterized by a preferred spatial frequency and an orientation. It is essentially the product of

233

234

Y. Girdhar and G. Dudek prior S2

S3

posterior

S1

S5 d

Z

Figure 11.3 Set-theoretic surprise. We model our prior using a set containing the summary images {Si } and the posterior using a set containing summary images and the observed image Z. Each image is represented as a bag-of-words histogram, normalized to form a probability distribution. The surprise is then deﬁned as the Hausdorff distance between these sets of probability distributions. We use the KL divergence as the distance metric. Since the two sets differ only by one element, the Hausdorff distance can be simpliﬁed to ﬁnding only the closest element in the summary set. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

Figure 11.4 Sine Gabor ﬁlters with orientations of 0◦ , 30◦ , . . . , 180◦ and wavelengths of 4, 8, 16, and 32 pixels.

a Gaussian and a sine function of a given frequency oriented along a given direction. An example is shown in Figure 11.4. Gabor ﬁlters have been used extensively for texture classiﬁcation (Daugman, 1985; Turner, 1986; Jain and Farrokhnia, 1991; Grigorescu et al., 2002).

A surprising problem in navigation •

11.4

Bag of words. Sivic and Zisserman (2006) have proposed a “bag-of-words model,” in which each image is described as a histogram of word counts. The “words” used in the histogram are obtained by clustering SIFT (Lowe, 2004) features. We extract SIFT or SURF (Bay et al., 2008) features from both the observation and the summary image being compared and cluster them using the k-means clustering algorithm to generate a vocabulary of 64 words, unique to that pair of images. The normalized frequency counts of these SIFT or SURF words in the observation and the summary are then assumed to be their respective descriptions.

Offline navigation summaries

In this section, we present two different ideas for selecting the summary images ofﬂine. 11.4.1

Extremum summaries

Algorithm 11.1 presents a simple greedy approach to the summary problem. Here k is the number of desired images in the summary set, Z is the set of input images, and S is the set of summary images. This algorithm has O(|Z|2 ) computational complexity. This algorithm essentially picks images at the corners of the highdimensional manifold formed by the images. S←∅ repeat Zmax ← argmax ξ(Zi |S) Zi ∈Z

S ← S ∪ {Zmax } Z ← Z \ Zmax until |S| ≥ k return S Algorithm 11.1 ExtremumSummary(Z|k). Computes a summary as a subset of the set of input images Z by greedily picking the images with maximum surprise. Picking extremum images implies that we are focusing on ﬁnding only the surprises in the world. However, if our goal is to ﬁnd images which are also representative of the mean appearance of the world, we must come up with an alternative strategy.

235

236

Y. Girdhar and G. Dudek 11.4.2

Summaries using set cover nethods

The task of selecting summary images can also be modeled as an instance of the classical set cover problem. Let Z = {Zi } be our set of all observed images. Each of these observed images might contain visual information about not just its own cells, but also other cells with similar visual characteristics. We would now like to pick images to form our summary set S so that we are not surprised when we observe any image from the terrain. We say that an image Zj is covered by an image Zi if observing Zj is not surprising given that we have already observed Zi . Assuming we have some way of measuring the surprise ξ , we can then deﬁne the cover of an image Z as C(Z|ξT ) = {Zj } : ξ(Zj |Z) < ξT ,

(11.13)

where ξ(Zj |Zi ) measures the surprise in observing Zj given image Zi , and ξT is the threshold surprise for including an image in the cover. Similarly, we can deﬁne the cover of a set of images S as C(S|ξT ) =

C(Si |ξT ).

(11.14)

Si ∈S

Our goal is now to ﬁnd the minimal set of images S which covers the entire terrain. This is essentially an instance of the classical set cover problem, with |Z| elements in the universe and |Z| sets which span the universe. Figure 11.5 shows an example of an instance of the set cover problem.

Z1

Z2

Z4

Z5

Z3

Z6

Z7

Figure 11.5 An instance of the set cover problem. The set {Z4 , Z5 , Z6 , Z7 } is the smallest set of sets which cover all the elements in the universe. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/ 9781107001756).

A surprising problem in navigation The set cover problem is known to be NP-hard (Karp, 1972). Hence we use a greedy strategy to pick our images. Algorithm 11.2 greedily picks an image Zmax which provides the maximum additional cover, and then adds it to the summary set S. We stop when the coverage ratio exceeds the parameter γ . Setting a value of γ = 1 implies that we stop when the summary covers all the images in the summary set. Lower values of γ can be used for higher noise tolerance, but at the cost of an increased likelihood of missing a surprising image. Another parameter for this algorithm is the surprise threshold ξT used for computing the cover of an image. The number of images in the summary obtained using this greedy strategy is guaranteed to be no more than OPT log |Z|, where OPT is the number of sets in the optimal summary set (Chvatal, 1979). S←∅ repeat Zmax ← argmax C({Zi } ∪ S|ξT ) \ C(S|ξT ) Zi ∈Z

S ← S ∪ {Zmax } Z← Z \ Zmax C(S|ξT ) until |Z| < γ return S Algorithm 11.2 γ -CoverSummary(Z|ξT ). Computes a summary as a subset of the set of input images Z, given the surprise threshold ξT , by greedily picking the images with maximum cover. We stop when the coverage ratio is more than γ . Often, we do not have a way of estimating a good value for the surprise threshold ξT . Instead, we can ﬁx the number k of images we want in our summary and then ﬁnd the smallest value of ξT that gives us the desired coverage ratio. Using Algorithm 11.3, we can hence deﬁne ξT as

ξT = min ξT :

k-CoverSummary (Z|k, ξT ) |Z|

≥γ .

(11.15)

Algorithm 11.3 greedily picks k summary images which provide maximum combined cover for the summary set. This problem of ﬁnding the optimal set of samples with the maximum cover is known as the max k-cover problem. The max k-cover problem, like the set cover problem, is also known to be NP-hard, and the greedy approach to this problem approximates the optimal solution to within a ratio of 1 − 1/e (Slavík, 1997).

237

238

Y. Girdhar and G. Dudek 1 2 3

S←∅ repeat Zmax ← argmax C({Zi } ∪ S|ξT ) \ C(S|ξT ) Zi ∈Z

S ← S ∪ {Zmax } Z ← Z \ Zmax k ← k−1

4 5 6 7 8

until k > 0 return S

Algorithm 11.3 k-CoverSummary(Z|k, ξT ). Computes a summary with k images, given the surprise threshold ξT , by greedily picking the images with maximum cover.

11.5

Online navigation summaries

A navigation summary could be considered as a model of the visual appearance of the world. A surprising observation can then be deﬁned as being one which is not well explained by the model. In an online scenario, we can compute the surprise of an incoming observation given the current summary and, if it has a surprise value that is above some threshold surprise, we can then choose to include it in the current summary. However, since we require the summary set to be of ﬁnite size, we must either allow a new choice to replace a previous pick or, if we pick irrevocably, then we must pick only k images. 11.5.1

Wobegon summaries: picking with replacement

Broder et al. (2008) named the general strategy of picking samples above the mean or median score as “Lake Wobegon strategies.”1 If the number of observed images is unknown or possibly inﬁnite, then picking images above the mean surprise of the previously selected images is a good strategy. We alter the basic strategy in two ways, as shown in Algorithm 11.4. First, we consider for selection only those images which locally have maximum surprise. Second, when we compute the threshold surprise ξT , we recompute the surprise of each selected image relative to other images in the summary. This allows occasional lowering of the threshold if we ﬁnd a good summary image which makes other images in the summary redundant. 1

Named after the ﬁctional town “Lake Wobegon,” where, according to Wikipedia, “all the women are strong, the men are good looking, and all the children are above average” (Broder et al., 2008).

A surprising problem in navigation 1 2 3 4 5 6 7 8

S = {Z1 } t←1 ξT ← 0 repeat t ← t+1 if ξ(Zt |S) > ξT and ξ˙ (Zt |S) = 0 then S ← S ∪ {Zt } ξT ← E ξ(Si S \ {Si }) end if |S| > k then discard_one(S) end

9 10 11 12 13 14

until t ≥ n return S

Algorithm 11.4 OnlineSummaryWithReplacement(Z|k).

If we continuously select the images which are above the average surprise value of the previously selected images, then, assuming that the surprise scores are from a uniform random distribution, it can be shown that we will have inﬁnitely many images as the time tends to inﬁnity (Broder et al., 2008). Since we are interested in maintaining a ﬁnite number of images in the summary set, we must then come up with a strategy to discard a summary image when the size of the summary exceeds the maximum size. We can suggest two different discarding strategies, each of which leads to a different kind of summary: (1)

(2)

Discard oldest. We deﬁne the age of a summary image in terms of the time of the last observation which matched that summary image. Hence, if a summary image is regularly observed, it is kept in the summary. This ensures that we have images which correspond to the mean appearance of the world in our summary, since they are needed to identify the surprises. This strategy produces a summary which focuses on describing recent observations. Discard least surprising. We discard the summary image which is least surprising, given the remaining summary images. If Sr is the discarded summary image, then we have r = argmin ξ ∗ (Si S − {Si }). i

(11.16)

239

240

Y. Girdhar and G. Dudek Discarding the least surprising image is a good strategy if we would like a temporally global summary of all the observations seen so far.

11.5.2

k-secretaries summaries: picking without replacement

We will now discuss a sampling strategy in which the image choices are irrevocable. Consider the strategy in Algorithm 11.5 for selecting k summary images from n observations. First, we initialize the summary by adding the ﬁrst observed image to the summary set. Next, we observe the ﬁrst r images without picking any, and set the surprise threshold ξT to the maximum surprise observed in this observation interval. Finally, we pick the ﬁrst k − 1 images which exceed the surprise threshold.

1 2 3 4 5 6 7 8 9 10

S = {Z1 } ξT ← max {ξ(Z2 |S), · · · , ξ(Zr |S)} t←r repeat t ← t+1 if ξ(Zt |S) > ξT then S ← S ∪ {Zt } end until |S| ≥ k or t ≥ n return S

Algorithm 11.5 OnlineSummaryWithoutReplacement(Z|k).

The main problem now is to ﬁnd an optimal value of r which maximizes the probability of selecting the k most surprising images. This problem can be modeled as an extension of the classical secretary-hiring problem, which can described as follows: “You are given the task of hiring the best possible candidate for the job of a secretary. There are n applicants who have signed up for the interview. After interviewing each candidate, you can rank them relative to all other candidates seen so far. You must either hire or reject the candidate immediately after the interview. You are not allowed to go back and hire a previously rejected candidate.” Dynkin (1963) showed that if we observe the ﬁrst r = n/e candidates to ﬁnd the threshold and then choose the ﬁrst candidate above this threshold, then we maximize the probability of ﬁnding the top candidate. In Algorithm 11.5, we are selecting k images instead, and we would like to adjust r accordingly.

A surprising problem in navigation The candidate score is the image surprise, which we assume to be uncorrelated with other images. Let the probability of success be (r), where success is deﬁned by the event that all of the top k highest-scoring candidate images have been selected. Without loss of generality, for the purpose of simplifying this analysis, we assume that we select k images which exceed the threshold, instead of k − 1. Let Ji be the event that, with the selection of the ith candidate, we have succeeded. We can then write k (r) = P(Success)

(11.17)

= P(∪ni=r+k Ji ).

(11.18)

We ignore the ﬁrst r candidates, since those candidates are never selected, by the deﬁnition of the algorithm, and then we can ignore the next k −1 candidates, since it is impossible to select k candidates from k − 1 possibilities. We can then write k (r) as k (r) =

n

P(Ji )

(11.19)

i=r+k

i−r−1

n k r · · k−1 = n

n i−1 k−1 i=r+k

r k = · n · n k−1

n−1 i=r+k+1

(11.20)

i−r

k−1

i

,

(11.21)

where nk is a binomial coefﬁcient. Let us examine the three components of Eq. (11.20). The ﬁrst term, k/n, is the probability that the ith candidate is one of the top k candidates. The second term, r/(i − 1), is the probability that none of the previous candidates were the last of the top k selected candidates. The third

n

term, i−r−1 k−1 / k−1 , is the probability that all of the remaining k − 1 candidates have been selected. By combining these terms, we get the probability of the event that we have successfully selected the last of the top k candidates. We are now interested in the value of r which maximizes k (r). A closed-form solution to this optimization problem is now known. However, by substituting for different values of k and using numerical methods, we have computed optimal values of r for different values of k and found the approximation r≈

n . ke1/k

(11.22)

This can be seen as a generalization of the classical secretary-hiring problem, since if we substitute k = 1 we get the same solution as for the classical problem.

241

242

Y. Girdhar and G. Dudek We can now use Algorithm 11.5 to identify the most surprising images while maximizing the probability that those images are the top k most surprising. 11.6

Results

In this section, we show summaries generated using the four different strategies presented in this chapter, using two different data sets shown in Figures 11.6a and 11.7a. Figure 11.6a shows a set of 90 images taken over a simulated Mars terrain. The images show rocks of different shapes and sizes, and sand of different textures. We used Gabor response histograms with λ = 32, σ = 16, and four different orientations to describe the images. To achieve orientation invariance, we computed the surprise of an image relative to another image by considering four different relative orientations and then choosing the minimum surprise. A Gabor-based descriptor was chosen to emphasize the texture in the data set. Figure 11.7a shows images from a much larger data set, with over 7000 images, of which we show only 64 uniformly sampled images. The data was collected by an aerial vehicle ﬂying over a coastal terrain, with a downwardlooking camera. We used hue histograms to describe the images to facilitate understanding of the results. 11.6.1

Ofﬂine summaries

Figure 11.6b shows the summary generated using the extremum summary algorithm for the Mars data set. Figure 11.6c shows the summary generated using the k-cover summary algorithm. We set k = 4 and used a surprise threshold ξT = 0.05 to get a ﬁnal coverage ratio γ = 0.99. For both ofﬂine algorithms, we initialized the summary with the image which had the smallest mean surprise relative to all other images. The extremum summary algorithm focuses on picking samples that are as different as possible from what is already in the summary, whereas the k-cover summary algorithm emphasizes picking images that describe the maximum number of undescribed (uncovered) images. This difference is highlighted by the third image obtained with the extremum summary algorithm. The white rock, although quite different from everything else, is not very common in the data set. It is hence not picked by the k-cover summary algorithm. Figure 11.7b shows the summary generated using the extremum summary algorithm for the aerial-view dataset. Figure 11.7c shows the summary generated using the k-cover summary algorithm with surprise threshold ξT = 0.1, to get a ﬁnal coverage ratio γ = 0.98. Both summaries cover the visual diversity of the data set well and include images of the beach, buildings, bushes, and land.

A surprising problem in navigation

(a) Input

(b) Offline extremum summary

(c) Offline k-cover summary Figure 11.6 Ofﬂine summary of a simulated Mars terrain data set. (a) Set of 90 images in the observation set. (b) Summary consisting of four images generated using Algorithm 11.1, which greedily chooses the most surprising images. (c) Summary consisting of four images generated using Algorithm 11.3, which greedily picks the images with maximum cover, i.e., images which describe the maximum number of other images. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

243

244

Y. Girdhar and G. Dudek

(a) Sub sampled input

(b) Offline extremum summary

(c) Offline k-cover summary Figure 11.7 Ofﬂine summary of an aerial-view data set. (a) Set of 64 images, uniformly subsampled from the full data set consisting of over 7000 images. (b) Summary consisting of six images generated using Algorithm 11.1, which greedily chooses the most surprising images. (c) Summary consisting of six images generated using Algorithm 11.3, which greedily picks the images with maximum cover, i.e., images which describe the maximum number of other images. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

A surprising problem in navigation Surprise 0.35 0.30 0.25 0.20 0.15 0.10 0.05 Time 40 60 (a) Surprise and selection threshold over time

80

Time

20

(b) Summary evolution Figure 11.8 Wobegon summaries – online picking with replacement. A succession of four-frame navigation summaries computed by the system at successive points in time for the Mars terrain data set. Each row depicts an intermediate navigation summary computed based on the (partial) data recorded as the robot captured additional frames. Time progresses downwards, with the top being the ﬁrst result and the bottom row being the last. The variance in the appearance of the frames increases with time. The ﬁnal summary, represented by the bottom row, consists of frames representing rocks and gravels of different textures. (See text for further details.) A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

245

Y. Girdhar and G. Dudek Surprise 0.6 0.5 0.4 0.3 0.2 0.1 Time 20

40

60

80

(a) Surprise and selection threshold over time

Time

246

(b) Summary evolution Figure 11.9 Online summary (without replacement). A succession of navigation summaries computed by the k-secretaries summary algorithm for the Mars terrain data set with 90 images. (a) shows the surprise over time. The threshold is the maximum surprise value observed in the ﬁrst n/ke1/k = 21 images. We see four peaks above the threshold, out of which only the images corresponding to the ﬁrst three were picked. (b) The summaries obtained. Compared with the summary generated with replacement allowed, we see that all major image types are still represented without the need for repeated reﬁnement. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

The difference between the two algorithms is highlighted by the ﬁfth choice. The extremum summary algorithm picks an image of a rapid, which is different from everything else and is very rare in the data set. The k-cover summary algorithm instead picks an area of land, which, although not very different from the third choice, covers many images in the full data set.

A surprising problem in navigation Surprise 1.0 0.8 0.6 0.4 0.2 Time 1000

2000

3000

4000

5000

6000

7000

Time

(a) Surprise and selection threshold over time

(b) Summary evolution Figure 11.10 Online summary (with replacement). A succession of six-frame navigation summaries computed by the system at successive points in time during the ﬂight of an aerial robot. Each row depicts an intermediate navigation summary computed based on the (partial) data recorded as the robot captured additional frames. Time progresses downwards, with the top being the ﬁrst result and the bottom row being the last. The variance in the appearance of the frames increases with time. The ﬁnal summary, represented by the bottom row, consists of frames representing buildings, land, clear ocean, and ocean rapids. (See text for further details.) A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

11.6.2

Online summaries

Figures 11.8b and 11.10b show the evolution of the summary set over time for the aerial-view data set using picking with replacement (Algorithm 11.4). Each successive row shows the state of the summary after a ﬁxed time step. The ﬁnal

247

Y. Girdhar and G. Dudek Surprise 2.0 1.5 1.0 0.5 Time 1000

2000

3000

4000

5000

6000

7000

(a) Surprise and selection threshold overtime

Time

248

(b) Summary evolution Figure 11.11 Online summary (without replacement). A succession of navigation summaries computed by the k-secretaries summary algorithm. The threshold was determined by observing the ﬁrst n/ke1/k images, and then each image above the threshold surprise was picked irrevocably. Although we requested six images, only four images that exceeded the threshold surprise were selected by the algorithm. Compared with the summary generated with replacement allowed (shown in Figure 11.10b), we see that all major image types are still represented. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

row is the summary after the last image was observed. We used the “discard least surprising” strategy to eliminate an image from the summary set when the set grew larger than the maximum speciﬁed size. Figures 11.8a and 11.10a show plots of the surprise of the incoming observations and the acceptance threshold over time. We see that initially, since the

A surprising problem in navigation threshold is low, we rapidly pick several images, and in the process, the threshold grows rapidly. This is also clear from Figure 11.10b, where we see that the initial rows are ﬁlled with similar looking images, which is the result of a low threshold. However, with time, we see the variance in the appearance of the images increasing rapidly. Figures 11.9b and 11.11b show the evolution of the summary over time obtained using the k-secretaries algorithm, which allows picking without replacement. We see that our choices have a high variance in appearance even though we have not resorted to repeated reﬁnement of the choices.

11.7

Conclusions

We have looked at ofﬂine and online strategies for picking images which epitomize the appearance of the world as observed by a robot. For the ofﬂine case, we have looked at two different strategies for picking the summary images. The extremum summary algorithm recursively picks the images that are the most surprising, given the current summary, and then adds them to the current summary. The set cover summary algorithm, instead of picking the most surprising images, emphasizes the picking of the images that describe the maximum number of observations (cover). When one is picking images online, it is not possible to pick the images with the maximum cover, since computing the cover requires knowing all observations. Hence, we have focused on picking the images that are the most surprising. We have presented two different strategies. The Wobegon summary algorithm, inspired by the Lake Wobegon sampling strategy, simply picks images which are above the mean surprise of the images in the summary, replacing a previous choice if needed. The k-secretaries summary algorithm, inspired by the classical secretary-hiring problem, allows images to be picked without replacement. We have shown showed how to compute a theoretically optimal surprise threshold which maximizes the probability of selecting all of the k most surprising observations. References Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2006). SURF: speeded up robust features. Comput. Vis. Image Understand., 110: 346–359. Bourque, E. and Dudek, G. (2000). On the automated construction of image-based maps. Auton. Robots, 8: 173–190. Broder, A. Z., Kirsch, A., Kumar, R., Mitzenmacher, M., Upfal, E., and Vassilvitskii, S. (2008). The hiring problem and Lake Wobegon strategies. In Proceedings of the

249

250

Y. Girdhar and G. Dudek Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), San Francisco, CA, pp. 1184–1193. Chang, H. S., Sull, S., and Lee, S. U. (1999). Efﬁcient video indexing scheme for content-based retrieval. IEEE Trans. Circuits Syst. Video Technol., 9: 1269–1279. Chvatal, V. (1979). A greedy heuristic for the set-covering problem. Math. Oper. Res., 4: 233–235. Daugman, J. G. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical ﬁlters. J. Opt. Soc. Am. A, 2: 1160–1169. Dudek, G. and Lobos, J.-P. (2009). Towards navigation summaries: automated production of a synopsis of a robot trajectories. In Proceedings of the Canadian Conference on Computer and Robotic Vision (CRV), Kelowna, BC, pp. 93–100. Dynkin, E. B. (1963). The optimum choice of the instant for stopping a Markov process. Sov. Math. Dokl, 4: 627–629. Girdhar, Y. and Dudek, G. (2009). Optimal online data sampling or how to hire the best secretaries. In Proceedings of the 2009 Canadian Conference on Computer and Robot Vision, Kelowna, BC, pp. 292–298. Girdhar, Y. and Dudek, G. (2010a). Online navigation summaries. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Anchorage, AK, pp. 5035–5040. Girdhar, Y. and Dudek, G. (2010b). ONSUM: a system for generating online navigation summaries. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Taipei. Gong, Y. and Liu, X. (2000). Video summarization using singular value decomposition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Hilton Head, SC, pp. 174–180. Grigorescu, S. E., Petkov, N., and Kruizinga, P. (2002). Comparison of texture features based on Gabor ﬁlters. IEEE Trans. Image Process., 11: 1160–1167. Itti, L. and Baldi, P. (2009). Bayesian surprise attracts human attention. Vis. Res., 49: 1295–1306. Jain, A. K. and Farrokhnia, F. (1991). Unsupervised texture segmentation using Gabor ﬁlters. Pattern Recogn., 24: 1167–1186. Karp, R. M. (1972). Reducibility among combinatorial problems. In R. E. Miller and J. W. Thatcher (eds.), Complexity of Computer Computations, pp. 85–103. New York: Plenum. Konolige, K., Bowman, J., Chen, J. D., Mihelich, P., Calonder, M., Lepetit, V., and Fua, P. (2009). View-based maps. In Proceedings of Robotics: Science and Systems, Seattle, WA. Kullback, S. (1959). Information Theory and Statistics, New York: Wiley. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis., 60: 91–110. Ngo, C.-W., Ma, Y.-F., and Zhang, H.-J. (2005). Video summarization and scene detection by graph modeling. IEEE Trans. Circuits Syst. Video Technol., 15: 296–305.

A surprising problem in navigation Ranganathan, A. and Dellaert, F. (2009). Bayesian surprise and landmark detection. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation (ICRA), Kobe, Japan, pp. 1240–1246. Sivic, J. and Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In Proceedings of the International Conference on Computer Vision (ICCV), Nice, France, pp. 1470–1477. Sivic, J. and Zisserman, A. (2006). Video Google: efﬁcient visual search of videos. In J. Ponce, M. Hebert, C. Schmid, and A. Zisserman (eds.), Toward Category-Level Object Recognition, pp. 127–144. Lecture Notes in Computer Science, 4170. Berlin: Springer. Slavík, P. (1997). A tight analysis of the greedy algorithm for set cover. J. Algorithms, 25: 237–254. Turner, M. R. (1986). Texture discrimination by Gabor functions. Biol. Cybern., 55: 71–82.

251

PART III NATURAL-SCEN E P E R C E P TI ON

12

Making a scene in the brain russell a. epstein and sean p. macevoy

12.1

Introduction

Humans observers have a remarkable ability to identify thousands of different things in the world, including people, animals, artifacts, structures, and places. Many of the things we typically encounter are objects – compact entities that have a distinct shape and a contour that allows them to be easily separated from their visual surroundings. Examples include faces, blenders, automobiles, and shoes. Studies of visual recognition have traditionally focused on object recognition; for example, investigations of the neural basis of object and face coding in the ventral visual stream are plentiful (Tanaka, 1993; Tsao and Livingstone, 2008; Yamane et al., 2008). Some recognition tasks, however, involve analysis of the entire scene rather than just individual objects. Consider, for example, the situation where one walks into a room and needs to determine whether it is a kitchen or a study. Although one might perform this task by ﬁrst identifying the objects in the scene and then deducing the identity of the surroundings from this list, this would be a relatively laborious process, which does not ﬁt with our intuition (and behavioral data) that we can identify the scene quite rapidly. Consider as well the challenge of identifying one’s location during a walk around a city or a college campus, or through a natural wooded environment. Although we can perform this task by identifying distinct object-like landmarks (buildings, statues, trees, etc.), we also seem to have some ability to identify places based on their overall visual appearance.

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

255

256

R. A. Epstein and S. P. MacEvoy These observations suggest that our visual system might have specialized mechanisms for perceiving and recognizing scenes that are distinct from the mechanisms for perceiving and recognizing objects. There are many salient differences between scenes and objects that could lead to such specialization. Some of these differences are structural. As noted above, objects are spatially and visually compact, with a well-deﬁned contour. In contrast, scenes are spatially and visually distributed, containing both foreground objects and ﬁxed background elements (walls, ground planes, and distant topographical features). Consequently, we tend to speak of the shape of an object but the layout of a scene. Related to these structural differences are differences in ecological relevance. As a consequence of their compactness, objects are typically things that one can move – or, at least, imagine moving – to different locations in the world. In contrast, scenes are locations in the world; thus, it makes little sense to think of “moving” a scene. Put another way, objects are things one acts upon; scenes are things one acts within. Because of this, scenes are relevant to navigation in a way that objects typically are not. In this chapter, we will review the evidence for specialized scene recognition mechanisms in the human visual system. We will begin by reviewing behavioral literature that supports the contention that scene recognition involves analysis of whole-scene features such as scene layout and hence might involve processes that are distinct from those involved in object recognition. We will then discuss evidence from neuroimaging and neuropsychology that implicates a speciﬁc region in the human occipitotemporal lobe – the parahippocampal place area (PPA) – in processing of whole-scene features. We will describe various data analysis techniques that facilitate the use of functional magnetic resonance imaging (fMRI) as a tool to explore the representations that underlie scene processing in the PPA. Finally, we will bring objects back into the picture by describing recent experiments that explore how objects are integrated into the larger visual array. 12.2

Efficacy of human scene recognition

The ﬁrst evidence for specialized mechanisms for scene perception came from a classic series of studies by Potter and colleagues (Potter, 1975, 1976; Potter and Levy, 1969) showing that human observers can interpret complex visual scenes with remarkable speed and efﬁciency. Subjects in these experiments were asked to detect the presence of a target scene within a sequential stream of distractors. Each scene in the stream was presented for a brief period of time (e.g., 125 ms) and then immediately replaced by the next item in the sequence, a technique known as rapid serial visual presentation. Despite the challenging nature of this task, the subjects were remarkably efﬁcient,

Making a scene in the brain detecting the target on 75% of the trials. This high level of performance was observed even when the target scene was deﬁned only by a high-level verbal description (e.g., “a picnic” or “two people talking”) so that subjects could not know exactly what it would look like. Potter concluded that the subjects could not have been doing the task based on simple visual feature mapping: they must have processed the scene up to the conceptual level. The human visual system appears to be able to extract the gist (i.e., overall meaning) of a complex visual scene within 100 ms. Related results were obtained by Biederman (1972), who observed that recognition of a single object within a brieﬂy ﬂashed (300–700 ms) scene was more accurate when the scene was coherent than when it was scrambled (i.e., jumbled up into pieces). Biederman concluded that the human visual system can extract the meaning of a complex visual scene within a few hundred milliseconds and use it to facilitate object recognition. A notable aspect of this experiment is the fact that the visual elements of the scene were present in both the intact and the scrambled conditions. Thus, it is not the visual elements alone that affect object recognition, but the organization into a meaningful scene. In other words, the layout of the scene matters, lending support to the notion of a scene recognition system that is distinct from and might inﬂuence object recognition. Subsequent behavioral work has upheld the proposition that even very complex visual scenes can be interpreted very rapidly (Antes et al., 1981; Biederman et al., 1974; Fei-Fei et al., 2007; Thorpe et al., 1996). Although one might argue that scene recognition in these earlier studies reduces simply to recognition of one or two critical objects within the scene, subsequent work has provided evidence that this is not the whole story. Scenes can be identiﬁed based on their whole-scene characteristics, such as their overall spatial layout, without reducing them to their constituent objects. Schyns and Oliva (1994) demonstrated that subjects could classify brieﬂy ﬂashed (30 ms) scenes into categories (highway, city, living room, and valley) even if the scenes were ﬁltered to remove all high-spatial-frequency information, leaving only an overall layout of coarse blobs, which conveyed little information about individual objects. Computational modeling work has given further credence to this idea by demonstrating that human recognition of brieﬂy presented scenes could be simulated by recognition systems that operated solely on whole-scene characteristics. For example, Renniger and Malik (2004) developed an algorithm that classiﬁed scenes from 10 different categories based solely on their visual texture statistics. The performance of the model was comparable to that of humans attempting to recognize brieﬂy presented (37 ms) scenes. Then Greene and Oliva (2009) developed a scene recognition model that operated on seven global properties: openness, expansion, mean depth, temperature, transience,

257

258

R. A. Epstein and S. P. MacEvoy concealment, and navigability. These properties predicted the performance of human observers, insofar as scenes that were more similar in the property space were more often misclassiﬁed by the observers. Results such as these provide an “existence proof” that it is possible to identify scenes on the basis of global properties rather than by ﬁrst identifying the component objects. Indeed, the similarity of the human and model error patterns in Greene and Oliva’s study strongly suggests that we actually use these global properties for recognition. In other words, we have the ability to identify the scenic “forest” without having to ﬁrst represent the individual “trees.” Note that this does not mean that scenes are never recognized based on their component objects (we discuss this scenario later). However, it does argue for the existence of mechanisms that can process these scene-level properties. A ﬁnal line of behavioral evidence for specialized scene-processing mechanisms comes from the phenomenon of boundary extension (BE). When subjects are shown a photograph of a scene and are later asked to recall it, they tend to remember it as being more wide-angle than it actually was – as if its boundaries had been extended beyond the edges of the photograph (Intraub et al., 1992). Although it was initially thought to be a constructive error, more recent results indicate that BE occurs even when the scene has only been absent for an interval as short as 42 ms, suggesting that it is not an artifact of memory but indicative of the scene representations formed during online perception (Intraub and Dickinson, 2008). Speciﬁcally, a representation of the layout of the local environment may be formed during scene viewing that is more expansive than the portion of the scene shown in the photograph. When the photograph is removed, this layout representation remains, inﬂuencing one’s memory of the width of the scene. BE occurs only for scenes: it is not found after viewing decontextualized objects. Thus, BE provides another line of evidence for distinct scene- and object-processing systems because it indicates the existence of a special representation for scenic layouts.

12.3

Scene-processing regions of the brain

What are the neural systems that support our ability to recognize scenes? The ﬁrst inkling that there might be specialized cortical territory for scene recognition came from a 1998 fMRI paper (Epstein and Kanwisher, 1998). Subjects were scanned with fMRI while they viewed indoor and outdoor scenes along with pictures of common objects (blenders, animals, and tools) and faces. A region in the collateral sulcus near the parahippocampal–lingual boundary responded much more strongly to the scenes than to the other stimuli

Making a scene in the brain

Figure 12.1 Scene-responsive cortical regions. Subjects were scanned with fMRI while viewing scenes and nonscene objects. The parahippocampal place area (PPA) and retrosplenial complex (RSC) respond more strongly to the scenes than to the objects (highlighted voxels). The PPA straddles the collateral sulcus near the parahippocampal–lingual boundary. The RSC is in the medial parietal region and extends into the parietal–occipital sulcus. Scene-responsive territory is also observed in the transverse occipital sulcus region (not shown). A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

(Figure 12.1). This region was labeled the parahippocampal place area, or PPA, because it responded preferentially to images of places (i.e., scenes). These results were similar to contemporaneous ﬁndings that the parahippocampal– lingual region responds more strongly to houses and buildings than to faces (Aguirre et al., 1998; Ishai et al., 1999), which makes sense if one considers the fact that the facade of a building is a kind of partial scene. Subsequent work demonstrated that the PPA responds to a wide range of scenes, including cityscapes, landscapes, rooms, and tabletop scenes (Epstein et al., 2003; Epstein and Kanwisher, 1998). It even responds strongly to “scenes” made out of Lego blocks (Epstein et al., 1999). This last result is particularly interesting because the comparison condition was objects made out of the same Lego materials but organized as a compact object rather than a distributed scene (Figure 12.2a). Thus, the geometric structure of the stimulus appears to play an important role in determining whether the PPA interprets it as a “scene” or an “object.” We refer to the idea that the PPA responds to the overall structure of a scene as the layout hypothesis. Some of the strongest evidence for this idea comes from a study in which subjects viewed rooms that were either ﬁlled with furniture and other potentially movable objects or emptied such that they were just bare walls (Figure 12.2b). In a third condition, the objects from the rooms were displayed in a multiple-item array on a blank background. The PPA responded strongly to the rooms irrespective of whether they contained objects or not. In contrast, the response to the decontextualized objects was much lower. This ﬁnding suggests that the PPA responds strongly to layout-deﬁning background elements but may

259

260

R. A. Epstein and S. P. MacEvoy (a)

Lego scene

Lego objects

(b)

Scene

Backgrounds

Objects

Figure 12.2 The PPA encodes scenic layout. The magnitude of the PPA’s response to each of a series of scenes is shown in a bar chart on the right. (a) The PPA responds more strongly to “scenes” made out of Lego blocks than to “objects” made out of the same materials, indicating that the spatial geometry of the stimulus affects the PPA response. (b) The PPA responds almost as strongly to background elements alone as it does to scenes containing both foreground objects and background elements; in contrast, the response to the foreground objects alone is much lower. Note that although the locations of the objects were randomized in this experiment, additional studies indicated that the response in the object-alone condition was low even when the spatial arrangement was preserved from the original scene (Epstein and Kanwisher, unpublished data). The PPA was identiﬁed in each subject as the set of voxels in the collateral sulcus region that responded more strongly to real-world scenes than to real-world objects in a separate data set. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

play little role in processing the discrete objects within the scene. Originally, we (Epstein and Kanwisher, 1998) hypothesized that the PPA might respond solely to spatial layout – the geometric structure of the scene as deﬁned by these large surfaces. Behavioral data suggest that humans and animals preferentially use such geometric information during reorientation (Cheng, 1986; Cheng and Newcombe, 2005; Hermer and Spelke, 1994); thus, we speculated that the PPA might be the neural locus of a geometry-based reorientation module (Gallistel, 1990). However, the idea that the PPA responds solely to scene geometry while ignoring nongeometric visual features such as color or texture has not been proven, and indeed there is some evidence against this claim (Cant and Goodale, 2007).

Making a scene in the brain In general, though, the neuroimaging data suggest that the PPA plays an important role in processing global aspects of scenes rather than the objects that make them up. (However, see Bar (2004) for a different view.) This conclusion is consistent with neuropsychological data, which support the proposed role of the posterior parahippocampal/anterior lingual region in scene recognition. As one might expect from the neuroimaging literature, the ability to recognize scenes is often seriously degraded following stroke damage to this region. This syndrome is sometimes referred to as topographical agnosia because the patient has difﬁculty identifying a wide variety of large topographical entities, such as buildings, monuments, squares, vistas, and intersections (Mendez and Cherrier, 2003; Pallis, 1955). Interestingly, this deﬁcit is usually reported as an inability to identify speciﬁc places. To our knowledge, a deﬁcit in being able to recognize scenes at the categorical level (e.g., beach vs. forest) is not spontaneously reported, perhaps because there are alternative processing pathways for category analysis that do not require use of the PPA. Topographical agnosia patients often complain that some global organizing aspect of the scene is missing (Epstein et al., 2001; Habib and Sirigu, 1987); for example, a patient described by Hécaen (1980) reported that “it’s the whole you see, in a very large area I can identify some minor detail or places but I don’t recognize the whole.” This inability to perceive scenes as uniﬁed entities is also evident in a compensatory strategy that is often observed: rather than identifying places in a normal manner, topographical agnosia patients will often focus on minor visual details such as a distinctive mailbox or door knocker (Aguirre and D’Esposito, 1999). An important contrast to these patients is provided by DF, a patient whose PPA is preserved but whose object-form-processing pathway (including the lateral occipital complex (LOC)) is almost completely obliterated. Despite an almost total inability to recognize objects, DF is able to classify scenes in terms of general categories such as city, beach, or forest; furthermore, her PPA activates during scene viewing (Steeves et al., 2004). These results demonstrate a double dissociation between scene and object processing: the PPA appears to be part of a distinct processing stream for scenes that bypasses the more commonly studied object-processing pathway. This contention is also supported by anatomical connectivity studies (Kim et al., 2006). In sum, the PPA appears to play a critical role in scene recognition. Note, however, that this does not mean that it is the only region involved. At least two other regions respond more strongly to scenes than to other stimuli: a retrosplenial/medial parietal region that has been labeled the retrosplenial complex (RSC; Figure 12.1) and an area near the transverse occipital sulcus (TOS). We have published several studies exploring the idea that the PPA and RSC may play distinct roles in scene processing, with the PPA more concerned with scene

261

262

R. A. Epstein and S. P. MacEvoy perception/recognition and the RSC more involved in linking the local scene to long-term topographical memory stores (Epstein and Higgins, 2007; Epstein et al., 2007b). In this chapter, we focus primarily on the PPA because of its critical role in visual recognition. A perhaps even more important point, however, is to remember that a region does not have to respond preferentially to scenes to play an important role in scene recognition. Although it might be possible to recognize scenes based on whole-scene characteristics such as layout, they might also be identiﬁed in part through analysis of their constituent objects, in which case object-processing regions such as the Lateral Occipital Complex (LOC; Malach et al., 1995) and fusiform gyrus might be involved. We will explore this idea later in this chapter. 12.4

Probing scene representations with fMRI

What is the nature of the scene representations encoded by the PPA? Broadly speaking, we can imagine at least three very different ways in which the PPA might encode scenes. First, the PPA might encode the “shape” or “geometry” of the scene as deﬁned primarily by the large bounding surfaces. In this view, the representations supported by the PPA would be inherently three-dimensional, insofar as they code information about the locations of surfaces, boundaries, affordances, and openings in the scene. Second, the PPA might encode a “visual snapshot” of what scenes look like from particular vantage points. In this case, the PPA representation would be inherently two-dimensional, insofar as the coded material would be the distribution of visual features across the retina rather than the locations of surfaces in three-dimensional space. Finally, the PPA might encode a spatial coordinate frame that is anchored to the scene but does not include details about scene geometry (Shelton and Pippitt, 2007). In this case, the geometry of the scene would be processed, but only up to the point at which it is possible to determine the principal axis of the environment and to determine the observer’s orientation relative to this axis. Such a spatial code might be less useful for identifying the scene as a particular place or type of place, but might be ideal for distinguishing between different navigational situations. To help differentiate among these scenarios, an important question is whether the pattern of activation evoked in the PPA by a given scene changes with the observer’s viewpoint. That is, are two different views of the scene coded the same or differently? In the ﬁrst scenario, the scene geometry could be could be deﬁned relative to either a viewer-centered or a scene-centered axis (Marr, 1982). In the second scenario, we would expect PPA scene representations to be entirely viewpoint-dependent because the distribution of visual features on

Making a scene in the brain the retina necessarily changes with viewpoint. In the third scenario, one would also expect some degree of viewpoint-dependent coding, but the PPA might be more sensitive to some viewpoint changes than to others. For example, it might be more sensitive to viewpoint changes caused by changes in orientation but less sensitive to viewpoint changes caused by changes in position in which the orientation relative to the scene remains constant (Park and Chun, 2009). How can we address such a question with neuroimaging? Although the precise way that cognitive “representations” relate to neural activity is unclear (Gallistel and King, 2009), it is usually assumed that two items are representationally distinct if they cause different sets of neurons to ﬁre. For example, neurons in area MT are tuned for motion direction and speed. As not all neurons have the same tuning, stimuli moving in different directions activate different sets of neurons in this region. Ideally, we would like to know whether two views of the same scene activate the same or different neuronal sets in the PPA. Although the spatial resolution of fMRI is far too coarse to address this question directly, two data analysis methods have been used to get at the question indirectly. The ﬁrst method is multivoxel pattern analysis (MVPA). The basic unit of fMRI data is the voxel (volume element), which can be though of as the three-dimensional equivalent of a pixel. Each fMRI image typically consists of thousands of such voxels, each corresponding to a cube of brain tissue that typically measures 2–6 mm on a side. For example, a region such as the PPA might include from several dozen to several hundred 3 × 3 × 3 mm voxels. In standard fMRI analysis, one averages the responses of these voxels together. This gives one an estimate of how much the region as a whole activates in response to each condition. Although such univariate analyses are very useful for determining the kind of stimulus the region prefers (e.g., scenes but not faces in the case of the PPA), they are less useful for determining representational distinctions within the preferred stimulus class (e.g., whether the PPA distinguishes between forests and beaches). MVPA gets around this problem by doing away with the averaging. The voxel-by-voxel response pattern is treated as a multidimensional vector and various tests are done to determine whether the response vectors elicited by one condition are reliably different from the response vectors elicited by another. For example, in the paper that ﬁrst popularized this technique, Haxby et al. (2001) used MVPA to demonstrate that the ventral occipitotemporal cortex distinguished between eight object categories (faces, houses, cats, scissors, bottles, shoes, chairs, and scrambled nonsense patterns). A more recent study by Walther et al. (2009) used MVPA to demonstrate that the PPA, the RSC, and the lateral occipital object-sensitive region can reliably distinguish between

263

264

R. A. Epstein and S. P. MacEvoy scene categories such as beaches, buildings, forests, highways, mountains, and industrial scenes. In theory, it should be possible to use MVPA to investigate the issue of viewpoint sensitivity. In particular, if different views of the same scene were to activate distinguishable patterns in the PPA, this would establish viewpoint speciﬁcity. However, a negative result would not necessarily demonstrate viewpoint-invariant coding (at least, not at the single-neuron level). The extent to which MVPA can be used to interrogate representations that are organized at spatial scales much smaller than a voxel is a matter of considerable debate (Drucker and Aguirre, 2009; Kamitani and Tong, 2005; Sasaki et al., 2006). One could imagine a scenario in which individual PPA neurons are selective for different views of a scene, but in which these neurons are tightly interdigitated within each voxel such that it is impossible to distinguish between different views of a scene using MVPA. In this scenario, MVPA would not reveal the underlying neural code. It would, however, reveal aspects of the representation that are implemented at a coarser (supraneuronal) spatial scale; for example, whether neurons responding to different views of the same scene (or scene category) are clustered together. The second method for addressing such representational questions is fMRI adaptation (sometimes also referred to as fMRI attenuation or fMRI repetition suppression). Here one looks at the reduction in fMRI response when a stimulus is repeated (Grill-Spector et al., 2006; Grill-Spector and Malach, 2001). The critical question is whether response reduction occurs when the repeated stimulus is a modiﬁed version of the original. If it does, one infers that the original stimulus and the modiﬁed stimulus are representationally similar. For example, one might examine whether a previously encountered scene elicits a response reduction if shown from a previously unseen viewpoint. If so, we conclude that the representation of the scene is (at least partially) viewpoint-invariant. This technique is motivated by neurophysiological ﬁndings indicating that neurons in many regions of the brain respond more strongly to the ﬁrst presentation of a stimulus than to later presentations (Miller et al., 1993), as well as by behavioral studies indicating that experience with a stimulus can lead to an adapted detection threshold (Blakemore and Nachmias, 1971). In a typical implementation of an fMRI adaptation paradigm, two stimuli are presented within an experimental trial, separated by interval of 100–700 ms (Kourtzi and Kanwisher, 2001). For example, the two images could be different scenes, different views of the same scene, or the same view of the same scene (i.e., identical images). In this example, the “different-scene” condition is taken to be the baseline; that is, the condition for which there is no adaptation. The question of interest is then whether the response is reduced compared with

Making a scene in the brain this baseline when a scene is repeated from a different viewpoint (implying some degree of viewpoint invariance) or reduced only when a scene is repeated from the same viewpoint (implying viewpoint speciﬁcity). Across several experiments, we have consistently found that adaptation using this paradigm is viewpoint-speciﬁc (Epstein et al., 2003, 2005, 2007a, 2008). No (or very little) response reduction is observed in the same-scene/different-view condition, suggesting that, at least for large enough viewpoint changes, two views of the same scene are as representationally distinct as two different scenes. This would seem to resolve the issue. However, somewhat puzzlingly, one can get a somewhat different answer by implementing the fMRI adaptation paradigm in a slightly different way. Rather than repeating stimuli almost immediately within an experimental trial, one can repeat stimuli over longer intervals (several seconds or minutes) during which many other stimuli are observed (Henson, 2003; Vuilleumier et al., 2002). In this case, one treats the response to a previously unviewed scene (“new scene”) as the baseline and examines whether the response is lower for scenes that were previously viewed from a different viewpoint (“new view”) or reduced only for scenes that were previously viewed from the same viewpoint (“old view”). This gives a somewhat different result. Signiﬁcant response reductions are observed for new views compared with new scenes, suggesting some generalization of processing across views (Epstein et al., 2007a, 2008). The adaptation effect is not entirely viewpoint-invariant, as we observe a further reduction of the response for old views compared with new views; nevertheless, the pattern is quite different from that observed with the short-interval repetition paradigm. What are we to make of this discrepancy? The short-interval and long-interval fMRI adaptation paradigms appear to be indexing different aspects of scene processing. To verify that these effects were indeed distinct, we implemented an experiment in which both kinds of fMRI adaptation could be measured simultaneously (Epstein et al., 2008). Subjects were scanned while viewing scene images that were paired together to make three kinds of short-interval repetition trials (different scene, same-scene/different-view, and same-scene/same-view). Critically, the subjects had been familiarized with some but not all of the scene images before the scan session (see Figure 12.3a). This allowed us to cross the three short-interval (i.e., within-trial) repetition conditions with three long-interval repetition conditions (new scene, new view, and old view) to give nine trial types in which the short-interval and long-interval repetition states were independently deﬁned. Thus, for example, different-scene trials could be constructed from new scenes, from new views, and from old views. When we measured the fMRI response to these nine conditions, we observed a complete lack of interaction between the short-interval and long-interval

265

266

R. A. Epstein and S. P. MacEvoy Prescan

During scan

Different scene Scene 1, view 1

Scene 1, view 1

Scene 2, view 1

Same scene/ different view Scene 1, view 2

Scene 1, view 1

Scene 1, view 2

Same scene/ same view Scene 2, view 1

Scene 1, view 1

}

Scene 1, view 1

}

Old view

}

... No change (+ scene and view change not shown) Scene 25, view 1

Scene 25, view 1

No change (+ scene and view change not shown) Scene 1, view 3

New scene

New view

Scene 1, view 3

(a)

Figure 12.3 Long-interval and short-interval fMRI adaptation in the PPA. (a) Design of experiment. Subjects studied 48 scene images (two views each of 24 campus locations) immediately prior to the scan. The stimuli shown during the scan were either the study images (“old views”), previously unseen views of the studied locations (“new views”), or locations not presented in the study session (“new place”). Long-interval adaptation was examined by comparing fMRI response across these three conditions. Short-interval adaptation, on the other hand, was examined by measuring the effect of repeating items within a single experimental trial (different place, same place/different view, and same place/same view). The three short-interval repetition conditions were fully crossed with the three long-interval repetition conditions to give nine conditions in total, ﬁve of which are illustrated in the ﬁgure. (b) Results. Short-interval adaptation is almost entirely viewpoint-speciﬁc: response reduction is observed only when scenes are repeated from the same viewpoint. In contrast, long-interval adaptation is much more viewpoint-invariant: repetition reduction is observed even when scenes are presented from a different viewpoint. These data argue for distinct mechanisms underlying the two adaptation effects. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

Making a scene in the brain Short-interval adaptation

% signal change

1

n.s. ****

0.75 0.5 0.25

Long-interval adaptation 1 % signal change

PPA (scene>object)

0

L

R

**** **

0.75 0.5 0.25 0

Different scene Same scene/different view Same scene/same view

New scene New view Old view

(b) Figure 12.3 Continued

repetition effects, suggesting that they are caused by different underlying mechanisms. Furthermore, in a manner consistent with our earlier results, the short-interval adaptation effects were completely viewpoint-speciﬁc, while the long-interval adaptation effects showed a high degree of viewpoint invariance (Figure 12.3b). The fact that long-interval adaptation effects are more viewpoint-invariant than short-interval adaptation effects suggests that the long-interval effects might be initiated somewhere later in the processing stream. One possibility is that long-interval adaptation effects in the PPA might be caused by top-down modulation from other cortical regions. Indeed, long-interval effects are typically more prominent in areas involved in “mnemonic” processing such as the lateral frontal lobes (Buckner et al., 1998; Maccotta and Buckner, 2004) and the hippocampus (Epstein et al., 2008; Gonsalves et al., 2005), while short-interval effects are typically more prominent in early visual regions (Epstein, 2008; Ganel et al., 2006). Furthermore, transcranial magnetic stimulation (TMS) studies have linked the behavioral priming effects normally associated with long-interval adaptation to processing in the frontal lobes (Wig et al., 2005). Thus, the idea that long-interval adaptation reﬂects viewpoint-invariant representations coded outside the PPA, whereas short-interval adaptation effects reﬂect viewpointspeciﬁc representations inherent to the PPA, has some appeal. However, it is worth noting that whole-brain analyses revealed no clear high-level source for the long-interval repetition effect. Thus, at present, the hypothesis that short- and long-interval adaptation originate in distinct cortical regions must be considered speculative. Functional connectivity analyses might be useful for investigating this hypothesis further. A second possibility is that short- and long-interval adaptation effects reﬂect two separate mechanisms that are both anatomically localizable to the PPA. For

267

268

R. A. Epstein and S. P. MacEvoy example, short-interval fMRI adaptation might reﬂect short-term changes in the synaptic inputs to the PPA, perhaps caused by synaptic depression (Abbott et al., 1997), which is believed to operate on a relatively short timescale of less than 2 s (Muller et al., 1999). Long-interval adaptation, on the other hand, might reﬂect more permanent changes in interregional connectivity (Wiggs and Martin, 1998). Under this scenario, the PPA takes visual inputs in which different views of the same scene are representationally distinct and calculates a new representation in which different views of the same scene are representationally similar. This new representation could then be used as the basis for scene recognition. Although the interpretation of the fMRI adaptation data in terms of two different adaptation mechanisms is somewhat speculative, it is consistent with recent neurophysiological data. In an intriguing single-unit recording study, Sawamura et al. (2006) recorded from object-sensitive neurons in the monkey inferior temporal (IT) cortex and examined the adaptation effect when identical object images were repeated after a short interval (300 ms). In a second condition, they measured the cross-adaptation effect caused by presenting two different objects in sequence. Critically, these two different objects were chosen such that the neurons responded equally strongly to both in the absence of adaptation. Despite the fact that the neuron considered these objects to be “the same” in terms of ﬁring rate, the objects were distinguishable in terms of adaptation. Speciﬁcally, there was more adaptation when the same object was presented twice than when two “representationally identical” objects were presented in sequence. Sawamura et al. hypothesized that these repetition reductions might reﬂect adaptation at the synaptic inputs to the neuron, which would be nonoverlapping for the two objects. In this scenario, short-interval fMRI adaptation may reﬂect the selectivity of the neurons that provide input to a region rather than reﬂecting the selectivity of the neurons within that region itself.1

1

This scenario might also explain a recent failure to ﬁnd orientation-speciﬁc adaptation in V1 using the short-interval repetition paradigm (Boynton and Finney, 2003). The fact that V1 neurons are tuned for the orientation of lines and gradients has been well established using direct recording methods (Hubel and Wiesel, 1962). Thus, the absence of orientation-speciﬁc adaptation when two gratings were presented within a trial was surprising. A subsequent experiment replicated this ﬁnding (Fang et al., 2005) and also demonstrated that orientation-speciﬁc adaptation could be observed if a different adaptation paradigm was used in which adapting gratings were presented for several seconds in order to induce neural fatigue (Fang et al., 2005). One possible explanation for these discrepant results is that short-interval adaptation effects might be dominated by suppression in the thalamic inputs to V1 which do not contain information about grating orientation. A later experiment by the same group observed similar results for faces within the fusiform face area (Fang et al., 2007). The “fatigue” paradigm

Making a scene in the brain Note that, even under this second scenario, individual neurons in the PPA might not exhibit viewpoint-invariant responses. Although long-term adaptation effects show a much greater tolerance for viewpoint changes than do short-interval adaptation effects, they retain some degree of viewpoint speciﬁcity. Furthermore, it is possible that long-interval effects might reﬂect changes in network connectivity that do not relate directly to the tuning of individual neurons. For example, one could imagine that neurons corresponding to distinct views of a particular building might be intermixed within a cortical column. If long-interval adaptation operates on columns rather than neurons, one would observe viewpoint-invariant adaptation even though the individual neurons within the column were viewpoint-tuned. Indeed, a likely scenario is that PPA neurons are analogous to the object-sensitive neurons in the monkey IT, which respond in a viewpoint-sensitive manner. Recent studies have demonstrated that object identity can be “read out” from ensembles of such neurons even though there are almost no individual neurons that respond to object identity under all possible transformations of lighting, position, and viewpoint (DiCarlo and Cox, 2007; Hung et al., 2005; Tsao et al., 2006). Furthermore, the same set of neurons can provide information about both object category and object identity. Similarly, ensembles of PPA neurons might be readable by other cortical regions in terms of both the category and the identity of the scene being viewed, even if individual neurons are tuned to speciﬁc views of speciﬁc scenes. So, what do these MVPA and fMRI adaptation results tell us about the function of the PPA? The MVPA data of Walther and colleagues suggest that the PPA encodes information that allows scene categories to be distinguished. However, as noted above, these results do not necessarily indicate that PPA neurons are tuned for category (in the sense that there are “kitchen neurons”). Indeed, to our knowledge, there have been no reports of category-speciﬁc adaptation (kitchen A primes kitchen B) in the PPA, and some recent unpublished data from our laboratory have failed to ﬁnd such an effect (MacEvoy and Epstein, 2009a; Morgan et al., 2009). Rather, fMRI adaptation data indicate that the PPA adapts to repetition of a speciﬁc scene (i.e., two images of my kitchen

revealed orientation-tuned adaptation that decreased gradually as the angular difference between the adapting and the adapted face increased, in a manner consistent with the orientation-selective tuning curves observed in neurophysiological data. The shortinterval adaptation paradigm, on the other hand, revealed adaptation effects that were much more orientation-speciﬁc, insofar as adaptation was observed only when the adapting and adapted faces were shown from identical viewpoints. These data are similar to our own insofar as they indicate that measurements of short-interval fMRI adaptation effects may overestimate the degree of stimulus speciﬁcity within a region.

269

270

R. A. Epstein and S. P. MacEvoy rather than two images depicting two different kitchens), with incomplete tolerance for viewpoint changes (and a large degree of tolerance for retinal-position changes (MacEvoy and Epstein, 2007)). Thus, our best guess is that PPA neurons encode visual or spatial quantities that vary somewhat with viewpoint and differ strongly between individual scenes of the same category. Furthermore, the neurons responding to these quantities are clustered unequally within the region such that different scene categories elicit different voxelwise activation patterns. However, the precise nature of these quantities – whether they are geometric shape parameters, 2D visual features, or spatial relationships – is unknown. Indeed, the question is almost entirely unexplored. 12.5

Integrating objects into the scene

The literature reviewed thus far clearly implicates the PPA in scene recognition. We have argued that this recognition probably involves analysis of whole-scene quantities such as spatial layout. We are skeptical of the idea that the PPA encodes the individual objects within the scene, although it is worthwhile to note that recent MVPA studies indicate that PPA activation patterns can provide information about individual objects when they are presented in isolation (Diana et al., 2008). However, there are clearly circumstances in which information about object identity can be an important cue for recognizing a scene. Indeed, part of our ability to quickly understand the “gist” of a scene must involve integration of information about object identities; for example, understanding that a computer, a desk, a whiteboard, a lamp, and some chairs make an ofﬁce. How might this object information be integrated together into a scene? We addressed this question in a recent study (MacEvoy and Epstein, 2009b) in which we examined the fMRI response to multiple-item object arrays. Although these stimuli were not “scenes”– they contained no background elements and had no three-dimensional layout – they allowed us to explore the rules by which the visual system combines object representations when more than one object appears on the screen. Subjects were scanned while viewing blocks containing either single objects (chairs, shoes, brushes, or cars) or two-object arrays (with each object from a different category). To ensure that the subjects attended equally to both of the objects in the pairs, we asked them to detect stimulus repetitions that could occur randomly at either object location. We then used MVPA to decode the response to both single-object categories and object-category pairs. Our analyses focused on the LOC, which is the area of the brain that appears to be critically involved in processing object identity. Recent studies indicate that LOC response patterns can be used to decode scene

Making a scene in the brain categories (Walther et al., 2009) and also objects within scenes (Peelen et al., 2009). We found that the multivoxel response patterns in the LOC contained information not only about single objects (as demonstrated previously by Haxby and colleagues) but also about object pairs. In other words, one can use the distributed pattern of the fMRI response to tell not only whether the subject is looking at a shoe or a brush but also whether he or she is looking at a shoe and a brush together. Moreover, we were able to reliably distinguish between patterns evoked by object pairs that shared an object, such as shoe + brush, shoe + car, and shoe + chair. Thus the pattern evoked by each object array was uniquely prescribed by its particular combination of objects. This “uniqueness” comes from the particular way in which patterns evoked by pairs are constructed in the LOC. The patterns evoked by pairs were not random; rather, they obeyed an ordered relationship to the patterns evoked by each of their component objects when those objects were presented alone. Speciﬁcally, pair patterns were very well predicted by the arithmetic mean of the patterns evoked by each of the component objects (Figure 12.4). This relationship was strong enough that we were able to decode with 75% accuracy (chance = 50%, p < 10−5 ) the patterns evoked by pairs using synthetic pair patterns created by averaging pairs of single-object patterns. In this way, pair patterns not only are unique with respect to each other, but also ensure that the identity of each of their component objects is preserved. We hypothesized that this averaging rule may be a general solution to the problem of preserving the representations of multiple simultaneous objects in a population of broadly tuned neurons. Because averaging is a linear operation (i.e., summation followed by uniform scaling), simple deconvolution can recover the patterns evoked by each object in an array. Although simple summation without scaling would theoretically achieve the same ends, the scaling step that accompanies averaging preserves linearity under the practical constraint imposed by the limited response ranges of individual neurons. It is no surprise, then, that the same rule is encountered repeatedly throughout the visual hierarchy (Desimone and Duncan, 1995; MacEvoy et al., 2009; Zoccolan et al., 2005). But rather than as a result of interstimulus “competition” for representational resources, as some have previously suggested (Kastner et al., 1998), we see this rule as the outcome of a “cooperative” scheme aimed at preserving as much information as possible about each stimulus in an array. This result gives us a novel framework in which to understand several perceptual phenomena that affect multiobject arrays such as scenes. For instance, consider change blindness (Rensink et al., 1997; Simons and Rensink, 2005). This striking phenomenon is observed when two versions of a scene are alternated

271

R. A. Epstein and S. P. MacEvoy One exemplary LOC voxel 4

All LOC voxels 1

R 2 = 0.96 Slope = 0.53 Slope

Responses to pairs (p.s.c.)

272

2

+

0 0

2

4

Sum of object responses (p.s.c.) (a)

0.5

0 1000

1

(lowest)

(highest)

Pair classification accuracy rank (b)

Figure 12.4 Relationship between single-object and paired-object responses in the lateral occipital complex. (a) Data from a single exemplary voxel, illustrating the averaging rule (p.s.c., percentage signal change). The response to each of the six object pairs is plotted against the sum of the responses to each of the constituent objects when these objects are presented alone. The data are well ﬁtted by a straight line with a slope of 0.53, indicating that the pair response is approximately the average of the two single-object responses. (b) Data from across the LOC. We predicted that the linear relationship observed in (a) would be most evident in voxels from subregions of the LOC that were most informative about pair identity. We used a “searchlight” analysis to identify such subregions. For each voxel in the LOC, a 5 mm spherical neighborhood (or “searchlight”) was deﬁned, within which two quantities were calculated: (i) the pair-classiﬁcation performance based on the voxelwise pattern within the sphere, and (ii) the mean slope of the regression line relating the pair and single-object responses, as in (a). The graph shows the slope plotted as a function of pair-classiﬁcation performance, ranked from the worse-performing to the best-performing voxels. The slope approaches 0.5 for voxels located within the most informative subregions. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

on a screen. Even though the two versions of the scene might have quite substantial visual differences, subjects will take a surprisingly long time to notice these differences if low-level visual transients are camouﬂaged, for example by an intervening blank screen. The subjective impression is that the scene is “the same” every time – at least until the area of difference is speciﬁcally noted as a distinct and changing scene element. We hypothesize that this phenomenon occurs for two reasons. First, the scene representation in the PPA takes little note of individual elements; hence, as far as the PPA is concerned, the scene

Making a scene in the brain really is “the same” every time. Second, because the representation of the scene in the LOC is the average of the representations of the individual objects, changing any single object will have a negligible effect on the overall activity pattern. In much the same way, perceptual crowding can be seen as the outcome of the visual system’s attempt to preserve information about each element of an object array. At a certain array size, however, the averaging strategy that works well for small numbers of objects produces a pattern of population activity that is indistinguishable from noise, and the identity of individual elements can no longer be discerned. In both change blindness and crowding, veridical perception can be rescued by attention, which in this framework is a mechanism evolved to combat the signal-to-noise penalty caused by using response averaging (albeit with its own price of vastly reduced sensitivity to unattended objects). Returning to the issue of rapid scene recognition, if we consider the simple object arrays that we used in our experiment to be the very simplest forms of scenes (even if the PPA does not see them that way), then in our results we perhaps have a potential physiological correlate of gist: a pattern that, in preserving the identity of each object in a multiobject scene, forms a unique signature of that scene. Two other notable aspects of our results resonate with gist. First, our results were derived under task conditions designed to ensure that attention was distributed evenly between both objects in each pair, mimicking the attentional state of subjects viewing a brieﬂy presented scene in the original gist experiments. Second, in a way that was consistent with the schematic nature of scene gist, the patterns evoked by pairs did not contain information about the relative positions of the objects (each object pair was presented in two spatial conﬁgurations, e.g., shoe above brush, and brush above shoe.) Although MVPA could easily distinguish between patterns evoked by different positions of single objects, demonstrating information about absolute stimulus position, our attempts to decode the spatial conﬁgurations of pairs yielded chance performance. This ﬁnding is consistent with behavioral data indicating that, in the absence of focal attention, it is possible to extract “gist” information relating to object identity from scenes without determining the locations of the individual objects (Evans and Treisman, 2005). Somewhat surprisingly, then, our neuroimaging results may also allow us to say something more about what the sensation of gist actually is. In particular, if the pattern evoked by a multiple-object scene is linearly related to the patterns evoked by its constituent objects, then gist perception might correspond to an initial hypothesis about the set of objects contributing to this pattern (but not necessarily the recognition of each or any one of those objects) and a judgment about the category of scene that is most likely to contain these objects. In other words, gist perception does not need to follow object recognition, but could be

273

274

R. A. Epstein and S. P. MacEvoy a parallel assessment of the pattern evoked by multiple simultaneous objects, a pattern which independently feeds object recognition. This assessment of the object-related pattern within the LOC might also proceed in parallel with an assessment of scene layout in the PPA, with both analyses providing information about scene category.

12.6

Conclusions

The basic fact that human observers can rapidly and accurately identify complex visual scenes has been known for over 30 years. Despite this, the study of the cognitive and neural mechanisms underlying visual scene perception is just beginning. Behavioral work strongly supports the idea that scenes are recognized in part through analysis of whole-scene properties such as layout; complementarily to this, neuropsychological and neuroimaging data point to speciﬁc brain regions such as the parahippocampal place area in the mediation of these analyses. Gist perception might also involve rapid analysis of the objects within the scene, perhaps through extraction of a summary or mean signal processed by the lateral occipital and fusiform regions. We expect that our understanding of the neural basis of scene recognition will advance rapidly in the next few years through the deployment of advanced fMRI data analysis techniques such as MVPA and fMRI adaptation, which can be used to isolate the representations that underlie these abilities. Ultimately, we believe that it will be possible to develop a theory of scene recognition that ties together multiple explanatory levels, from the underlying single-neuron physiology, through systems neuroscience, to cognitive theory and behavioral phenomena. Acknowledgments We thank Emily Ward for comments on the manuscript and assistance with ﬁgures. We also thank the many colleagues who contributed to the work described here. This work was supported by the National Institutes of Health (EY-016464 to RE) and the National Science Foundation (SBE-0541957). References Abbott, L. F., Varela, J. A., Sen, K., and Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275: 220–224. Aguirre, G. K. and D’Esposito, M. (1999). Topographical disorientation: a synthesis and taxonomy. Brain, 122: 1613–1628.

Making a scene in the brain Aguirre, G. K., Zarahn, E., and D’Esposito, M. (1998). An area within human ventral cortex sensitive to “building” stimuli: evidence and implications. Neuron, 21: 373–383. Antes, J. R., Penland, J. G., and Metzger, R. L. (1981). Processing global information in brieﬂy presented pictures. Psychol. Res., 43: 277–292. Bar, M. (2004). Visual objects in context. Nature Rev. Neurosci., 5: 617–629. Biederman, I. (1972). Perceiving real-world scenes. Science, 177: 77–80. Biederman, I., Rabinowitz, J. C., Glass, A. L., and Stacy, E. W., Jr. (1974). On the information extracted from a glance at a scene. J. Exp. Psychol., 103: 597–600. Blakemore, C. and Nachmias, J. (1971). The orientation speciﬁcity of two visual after-effects. J. Physiol., 213: 157–174. Boynton, G. M. and Finney, E. M. (2003). Orientation-speciﬁc adaptation in human visual cortex. J. Neurosci., 23: 8781–8787. Buckner, R. L., Goodman, J., Burock, M., Rotte, M., Koutstaal, W., Schacter, D., Rosen, B. R., and Dale, A. M. (1998). Functional-anatomic correlates of object priming in humans revealed by rapid presentation event-related fMRI. Neuron, 20: 285–296. Cant, J. S. and Goodale, M. A. (2007). Attention to form or surface properties modulates different regions of human occipitotemporal cortex. Cereb. Cortex, 17: 713–731. Cheng, K. (1986). A purely geometric module in the rats spatial representation. Cognition, 23: 149–178. Cheng, K. and Newcombe, N. S. (2005). Is there a geometric module for spatial orientation? Squaring theory and evidence. Psychon. Bull. Rev., 12: 1–23. Desimone, R. and Duncan, J. (1995). Neural mechanisms of selective visual attention. Annu. Rev. Neurosci., 18: 193–222. Diana, R. A., Yonelinas, A. P. and Ranganath, C. (2008). High-resolution multi-voxel pattern analysis of category selectivity in the medial temporal lobes. Hippocampus, 18: 536–541. DiCarlo, J. J. and Cox, D. D. (2007). Untangling invariant object recognition. Trends Cogn. Sci., 11: 333–341. Drucker, D. M. and Aguirre, G. K. (2009). Different spatial scales of shape similarity representation in lateral and ventral LOC. Cereb. Cortex, 19(10): 2269–2280. Epstein, R. A. (2008). Parahippocampal and retrosplenial contributions to human spatial navigation. Trends Cogn. Sci., 12: 388–396. Epstein, R. A. and Higgins, J. S. (2007). Differential parahippocampal and retrosplenial involvement in three types of visual scene recognition. Cereb. Cortex, 17: 1680–1693. Epstein, R. and Kanwisher, N. (1998). A cortical representation of the local visual environment. Nature, 392: 598–601. Epstein, R., Harris, A., Stanley, D., and Kanwisher, N. (1999). The parahippocampal place area: recognition, navigation, or encoding? Neuron, 23: 115–125.

275

276

R. A. Epstein and S. P. MacEvoy Epstein, R., DeYoe, E. A., Press, D. Z., Rosen, A. C., and Kanwisher, N. (2001). Neuropsychological evidence for a topographical learning mechanism in parahippocampal cortex. Cogn. Neuropsychol., 18: 481–508. Epstein, R., Graham, K. S., and Downing, P. E. (2003). Viewpoint-speciﬁc scene representations in human parahippocampal cortex. Neuron, 37: 865–876. Epstein, R. A., Higgins, J. S., and Thompson-Schill, S. L. (2005). Learning places from views: variation in scene processing as a function of experience and navigational ability. J. Cogn. Neurosci., 17: 73–83. Epstein, R. A., Higgins, J. S., Jablonski, K., and Feiler, A. M. (2007a). Visual scene processing in familiar and unfamiliar environments. J. Neurophysiol., 97: 3670–3683. Epstein, R. A., Parker, W. E., and Feiler, A. M. (2007b). Where am I now? Distinct roles for parahippocampal and retrosplenial cortices in place recognition. J. Neurosci., 27: 6141–6149. Epstein, R. A., Parker, W. E., and Feiler, A. M. (2008). Two kinds of FMRI repetition suppression? Evidence for dissociable neural mechanisms. J. Neurophysiol., 99: 2877–2886. Evans, K. K. and Treisman, A. (2005). Perception of objects in natural scenes: is it really attention free? J. Exp. Psychol. Hum. Percept. Perform., 31: 1476–1492. Fang, F., Murray, S. O., Kersten, D., and He, S. (2005). Orientation-tuned FMRI adaptation in human visual cortex. J. Neurophysiol., 94: 4188–4195. Fang, F., Murray, S. O., and He, S. (2007). Duration-dependent FMRI adaptation and distributed viewer-centered face representation in human visual cortex. Cereb. Cortex, 17: 1402–1411. Fei-Fei, L., Iyer, A., Koch, C., and Perona, P. (2007). What do we perceive in a glance of a real-world scene? J. Vis., 7: 10. Gallistel, C. R. (1990). The Organization of Learning. Cambridge, MA: MIT Press. Gallistel, C. R. and King, A. P. (2009). Memory and the Computational Brain: Why Cognitive Science Will Transform Neuroscience. Chichester, UK; Malden, MA: Wiley-Blackwell. Ganel, T., Gonzalez, C. L., Valyear, K. F., Culham, J. C., Goodale, M. A., and Kohler, S. (2006). The relationship between fMRI adaptation and repetition priming. Neuroimage, 32: 1432–1440. Gonsalves, B. D., Kahn, I., Curran, T., Norman, K. A., and Wagner, A. D. (2005). Memory strength and repetition suppression: multimodal imaging of medial temporal cortical contributions to recognition. Neuron, 47: 751–761. Greene, M. R. and Oliva, A. (2009). Recognition of natural scenes from global properties: seeing the forest without representing the trees. Cogn. Psychol., 58: 137–176. Grill-Spector, K. and Malach, R. (2001). fMR-adaptation: a tool for studying the functional properties of human cortical neurons. Acta Psychol., 107: 293–321. Grill-Spector, K., Henson, R., and Martin, A. (2006). Repetition and the brain: neural models of stimulus-speciﬁc effects. Trends Cogn. Sci., 10: 14–23. Habib, M. and Sirigu, A. (1987). Pure topographical disorientation – a deﬁnition and anatomical basis. Cortex, 23: 73–85.

Making a scene in the brain Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., and Pietrini, P. (2001). Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science, 293: 2425–2430. Hécaen, H., Tzortzis, C., and Rondot, P. (1980). Loss of topographic memory with learning deﬁcits. Cortex, 16: 525–542. Henson, R. N. (2003). Neuroimaging studies of priming. Prog. Neurobiol., 70: 53–81. Hermer, L. and Spelke, E. S. (1994). A geometric process for spatial reorientation in young children. Nature, 370: 57–59. Hubel, D. H. and Wiesel, T. N. (1962). Receptive ﬁelds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160: 106–154. Hung, C. P., Kreiman, G., Poggio, T., and DiCarlo, J. J. (2005). Fast readout of object identity from macaque inferior temporal cortex. Science, 310: 863–866. Intraub, H. and Dickinson, C. A. (2008). False memory 1/20th of a second later: what the early onset of boundary extension reveals about perception. Psychol. Sci., 19: 1007–1014. Intraub, H., Bender, R. S., and Mangels, J. A. (1992). Looking at pictures but remembering scenes. J. Exp. Psychol.: Learn. Mem. Cogn., 18: 180–191. Ishai, A., Ungerleider, L. G., Martin, A., Schouten, J. L., and Haxby, J. V. (1999). Distributed representation of objects in the human ventral visual pathway. Proc. Natl. Acad. Sci. USA, 96: 9379–9384. Kamitani, Y. and Tong, F. (2005). Decoding the visual and subjective contents of the human brain. Nature Neurosci., 8: 679–685. Kastner, S., De Weerd, P., Desimone, R., and Ungerleider, L. C. (1998). Mechanisms of directed attention in the human extrastriate cortex as revealed by functional MRI. Science, 282: 108–111. Kim, M., Ducros, M., Carlson, T., Ronen, I., He, S., Ugurbil, K., and Kim, D.-S. (2006). Anatomical correlates of the functional organization in the human occipitotemporal cortex. Magn. Reson. Imaging, 24: 583–590. Kourtzi, Z. and Kanwisher, N. (2001). Representation of perceived object shape by the human lateral occipital complex. Science, 293: 1506–1509. Maccotta, L. and Buckner, R. L. (2004). Evidence for neural effects of repetition that directly correlate with behavioral priming. J. Cogn. Neurosci., 16: 1625–1632. MacEvoy, S. P. and Epstein, R. A. (2007). Position selectivity in scene- and object-responsive occipitotemporal regions. J. Neurophysiol., 98: 2089–2098. MacEvoy, S. P. and Epstein, R. A. (2009a). Building scenes from objects: a distributed pattern perspective. In Neuroscience Meeting Planner, Program No. 262.29. Chicago, IL: Society for Neuroscience. Online. MacEvoy, S. P. and Epstein, R. A. (2009b). Decoding the representation of multiple simultaneous objects in human occipitotemporal cortex. Curr. Biol., 19: 943–947. MacEvoy, S. P., Tucker, T. R., and Fitzpatrick, D. (2009). A precise form of divisive suppression supports population coding in the primary visual cortex. Nature Neurosci., 12: 637–645. Malach, R., Reppas, J. B., Benson, R. R., Kwong, K. K., Jiang, H., Kennedy, W. A., Ledden, P. J., Brady, T. J., Rosen, B. R., and Tootell, R. B. (1995). Object-related

277

278

R. A. Epstein and S. P. MacEvoy activity revealed by functional magnetic resonance imaging in human occipital vortex. Proc. Natl. Acad. Sci. USA, 92: 8135–8139. Marr, D. (1982). Vision. New York: W. H. Freeman. Mendez, M. F. and Cherrier, M. M. (2003). Agnosia for scenes in topographagnosia. Neuropsychologia, 41: 1387–1395. Miller, E. K., Li, L., and Desimone, R. (1993). Activity of neurons in anterior inferior temporal cortex during a short-term-memory task. J. Neurosci., 13: 1460–1478. Morgan, L. K., MacEvoy, S. P., Aguirre, G. K., and Epstein, R. A. (2009). Decoding scene categories and individual landmarks from cortical response patterns. In 2009 Neuroscience Meeting Planner, Program No. 262.8. Chicago, IL: Society for Neuroscience. Online. Muller, J. R., Metha, A. B., Krauskopf, J., and Lennie, P. (1999). Rapid adaptation in visual cortex to the structure of images. Science, 285: 1405–1408. Pallis, C. A. (1955). Impaired identiﬁcation of faces and places with agnosia for colours – report of a case due to cerebral embolism. J. Neurol. Neurosurg. Psychiatry, 18: 218–224. Park, S. and Chun, M. M. (2009). Different roles of the parahippocampal place area (PPA) and retrosplenial cortex (RSC) in panoramic scene perception. Neuroimage, 47: 1747–1756. Peelen, M. V., Fei-Fei, L., and Kastner, S. (2009). Neural mechanisms of rapid natural scene categorization in human visual cortex. Nature, 460: 94–97. Potter, M. C. (1975). Meaning in visual search. Science, 187: 965–966. Potter, M. C. (1976). Short-term conceptual memory for pictures. J. Exp. Psychol.: Hum. Learn. Mem., 2: 509–522. Potter, M. C. and Levy, E. I. (1969). Recognition memory for a rapid sequence of pictures. J. Exp. Psychol. 81: 10–15. Renninger, L. W. and Malik, J. (2004). When is scene identiﬁcation just texture recognition? Vis. Res., 44: 2301–2311. Rensink, R. A., O’Regan, J. K., and Clark, J. J. (1997). To see or not to see: the need for attention to perceive changes in scenes. Psychol. Sci., 8: 368–373. Sasaki, Y., Rajimehr, R., Kim, B. W., Ekstrom, L. B., Vanduffel, W., and Tootell, R. B. (2006). The radial bias: a different slant on visual orientation sensitivity in human and nonhuman primates. Neuron, 51: 661–670. Sawamura, H., Orban, G. A., and Vogels, R. (2006). Selectivity of neuronal adaptation does not match response selectivity: a single-cell study of the FMRI adaptation paradigm. Neuron, 49: 307–318. Schyns, P. G. and Oliva, A. (1994). From blobs to boundary edges: evidence for timeand spatial-scale-dependent scene recognition. Psychol. Sci., 5: 195–200. Shelton, A. L. and Pippitt, H. A. (2007). Fixed versus dynamic orientations in environmental learning from ground-level and aerial perspectives. Psychol. Res., 71: 333–346. Simons, D. J. and Rensink, R. A. (2005). Change blindness: past, present, and future. Trends Cogn. Sci., 9: 16–20. Steeves, J. K., Humphrey, G. K., Culham, J. C., Menon, R. S., Milner, A. D., and Goodale, M. A. (2004). Behavioral and neuroimaging evidence for a

Making a scene in the brain contribution of color and texture information to scene classiﬁcation in a patient with visual form agnosia. J. Cogn. Neurosci., 16: 955–965. Tanaka, K. (1993). Neuronal mechanisms of object recognition. Science, 262: 685–688. Thorpe, S., Fize, D., and Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381: 520–522. Tsao, D. Y. and Livingstone, M. S. (2008). Mechanisms of face perception. Annu. Rev. Neurosci., 31: 411–437. Tsao, D. Y., Freiwald, W. A., Tootell, R. B., and Livingstone, M. S. (2006). A cortical region consisting entirely of face-selective cells. Science, 311: 670–674. Vuilleumier, P., Henson, R. N., Driver, J., and Dolan, R. J. (2002). Multiple levels of visual object constancy revealed by event-related fMRI of repetition priming. Nature Neurosci., 5: 491–499. Walther, D. B., Caddigan, E., Fei-Fei, L., and Beck, D. M. (2009). Natural scene categories revealed in distributed patterns of activity in the human brain. J. Neurosci., 29: 10573–10581. Wig, G. S., Grafton, S. T., Demos, K. E., and Kelley, W. M. (2005). Reductions in neural activity underlie behavioral components of repetition priming. Nature Neurosci., 8: 1228–1233. Wiggs, C. L. and Martin, A. (1998). Properties and mechanisms of perceptual priming. Curr. Opin. Neurobiol., 8: 227–233. Yamane, Y., Carlson, E. T., Bowman, K. C., Wang, Z., and Connor, C. E. (2008). A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nature Neurosci., 11(11), 1352–1360. Zoccolan, D., Cox, D. D., and DiCarlo, J. J. (2005). Multiple object response normalization in monkey inferotemporal cortex. J. Neurosci., 25: 8150–8164.

279

13

Surface color perception and light ﬁeld estimation in 3D scenes laurence t. maloney, holly e. gerhard, huseyin boyaci, and katja doerschner 13.1

The light field

The spectral power distribution of the light emitted by the Sun is almost constant. The variations in daylight (Figure 13.1) that we experience over the course of a day and with changes in seasons are due to the interaction of sunlight with the Earth’s atmosphere (Henderson, 1977). The resulting spectral distribution of daylight across the sky is typically spatially inhomogeneous and constantly changing (Lee and Hernández-Andrés, 2005a,b). The light arriving at each small patch of surface in the scene depends in general on the patch’s location and orientation in the scene. Furthermore, objects in the scene create shadows or act as secondary light sources, adding further complexity to the light ﬁeld (Gershun, 1936/1939; see also Adelson and Bergen, 1991) that describes the spectral power distribution of the light arriving from every direction at every point in the scene. The light ﬁeld captures what a radiospectrophotometer placed at each point in the scene, pointing in all possible directions, would record (Figure 13.2). When the light ﬁeld is inhomogeneous, the light absorbed and reradiated by a matte1 smooth surface patch can vary markedly with the orientation or location of the patch in the scene. In Figure 13.3, for example, we illustrate the wide range of the light emitted by identical rectangular achromatic matte surfaces at many orientations,

1

We use the term “matte” as a synonym for “Lambertian,” a mathematical idealization of a matte surface (Haralick and Shapiro, 1993).

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

280

Surface color perception and light ﬁeld estimation in 3D scenes

Figure 13.1 Terrestrial daylight. Four views of the sky over Los Angeles. Courtesy of Paul Debevec. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

0˚

360˚

180˚ Figure 13.2 The light ﬁeld. The light ﬁeld is the spectral power distribution of light arriving from every possible direction at every point in the scene. For one wavelength and one location, the light ﬁeld can be represented as a sphere as shown. Courtesy of Paul Debevec.

281

282

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner

Figure 13.3 The effect of orientation. Identical rectangular matte patches at different locations and orientations, rendered under a distant neutral, collimated source placed along the line of sight. The luminance of each patch is proportional to the cosine of the angle between the surface normal and the direction towards the collimated source (Lambert’s law) (Haralick and Shapiro, 1993).

illuminated by a distant neutral, collimated light source. Any visual system designed to estimate surface color properties (including lightness) is confronted with a new problem to solve with each change of surface orientation or location. To arrive at a stable estimate of surface reﬂectance, the visual system has to discount the effects of the light ﬁeld on a patch. In many scenes, discounting spatial variation in the light ﬁeld is an underconstrained problem (Adelson and Pentland, 1996; Belhumeur et al., 1999; Dror et al., 2004). Moreover, detecting changes in the current light ﬁeld and distinguishing them from changes in objects within the scene is itself a potentially difﬁcult problem for the visual system. In this review, we describe recent research concerning surface color estimation, light ﬁeld estimation, and discrimination of changes in the light ﬁeld from other changes in the scene. 13.1.1

The Mondrian singularity

Previous research in color vision has typically avoided addressing the problems introduced by spatial and spectral variation in the light ﬁeld, by the choice of scenes. These scenes, consisting of ﬂat, coplanar matte surfaces,

Surface color perception and light ﬁeld estimation in 3D scenes have been referred to as Mondrians (Land and McCann, 1971). In such scenes, observers can accurately make a variety of judgments concerning surface color and lightness. Arend and Spehar (1993a,b) showed that observers were close to constant in estimating the lightness of a matte surface embedded in a twodimensional Mondrian despite changes in illumination. Foster and Nascimento (1994) and Nascimento and Foster (2000) showed that observers can reliably distinguish whether the change in appearance of a Mondrian is due to an “illumination” change or a reﬂectance change, and that this could be explained by a model based on cone-excitation ratios. Bäuml (1999) showed that observers are capable of constant estimation of the color of surfaces in Mondrians following changes in illumination, and that his results could be well accounted by using the von Kries principle, which is a simple linear transformation of cone responses. However, these studies need to be extended for two reasons. First, there is no obvious way to generalize these results to the normal viewing conditions of our ever-changing, three-dimensional world. In the ﬂat, two-dimensional world of Mondrians, no matter how complex the light ﬁeld is, the light emitted from a surface contains essentially no information about the spatio-spectral distribution of the light incident upon the surface (Maloney, 1999). A matte surface absorbs light from all directions in a hemisphere centered on its surface normal and then reemits uniformly in all directions a fraction of the total light absorbed: a matte surface “forgets” where the light came from. In previous work, we have called this phenomenon the Mondrian singularity (Boyaci et al., 2006a). A second reason why it is important to consider a wider range of stimuli in evaluating human color perception is that three-dimensional scenes can convey considerable information about the light ﬁeld in a scene. Maloney (1999) noted that there are potential “cues to the illuminant” in three-dimensional scenes that signal illuminant chromaticity. Here we consider recent work directed towards determining what cues signal how the intensity and chromaticity of the illumination incident on a matte surface vary with surface orientation and location in three-dimensional scenes. We also consider recent studies of human ability to estimate the light ﬁeld and to discriminate changes in the light ﬁeld from changes in the actual contents of a scene, including the surface colors of objects in the scene. 13.1.2

Illuminant cues

The perception of surface color in Mondrian scenes is an intrinsically difﬁcult problem. In order to estimate surface color accurately, the visual system must estimate the net intensity and chromaticity of the light incident on the Mondrian. The typical approach taken is to develop simple measures of the

283

284

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner central tendency, variance, and covariance of the photoreceptor excitations and use them as a basis for estimating the light intensity and chromaticity. “Gray world” algorithms (for reviews, see Hurlbert (1998) and Maloney (1999)), for example, use the mean chromaticity of the scene as an estimate of the chromaticity of the light. Mausfeld and Andres (2002) have conjectured that means, variances, and covariances contain all of the information used by the visual system in estimating surface color. Golz and Macleod (2002) and Macleod and Golz (2003) concluded that correlations between the chromaticities and luminance values of surfaces contained useful information about the chromaticity of the effectively uniform illumination of a Mondrian scene, but this conclusion has been challenged by recent work (Ciurea and Funt, 2004; Granzier et al., 2005). These measurements, based on simple moments (mean, variance, and covariance) of distributions, eliminate what little spatial structure is present in the Mondrian. It is not clear that simple moments2 derived from Mondrian scenes convey any information about the chromaticity of the illuminant or its intensity (Maloney, 1999), and they convey no information about spatial and spectral inhomogeneities in the light ﬁeld. When a scene is not restricted to ﬂat, coplanar, matte surfaces arranged in a Mondrian, more information about the chromaticity of the illuminant (Maloney, 1999; Yang and Maloney, 2001) and the spatial and spectral distribution of the light ﬁeld may be available to the observer. Researchers have shown that human observers are able to judge lower-order estimates of the light ﬁeld such as diffuseness and mean illumination direction (te Pas and Pont, 2005; Pont and Koenderink 2004). We emphasize that any deviation from “ﬂat” or “matte” in an otherwise Mondrian stimulus could disclose information about the light ﬁeld, and we refer to these sources of information as illuminant cues (Kaiser and Boynton, 1996; Maloney, 1999; Yang and Maloney, 2001; Pont and Koenderink, 2003, 2004; Koenderink and van Doorn, 1996; Koenderink et al., 2004). llluminant cues, by deﬁnition, carry information about the illuminant. It is possible to develop algorithms that estimate the light ﬁeld from such cues (see, e.g., Hara et al., 2005) and Ramamoorthi and Hanrahan (2001a), but currently such algorithms depend upon restrictive assumptions about the scene and its illumination. Such algorithms are based on the physics of image formation, and, when they succeed, we can be sure they carry the desired information. The relevant question concerning an illuminant cue is whether it is used in human vision.

2

Technically, any function of the retinal image is a statistic, and it is likely that the claim is vacuously true. In practice, researchers conﬁne their attention to the moments of lowest degree of the photoreceptor excitations in the retinal image (e.g., Mausfeld and Andres, 2002), and we use the term “scene statistics” as a synonym for these moments.

Surface color perception and light ﬁeld estimation in 3D scenes In this chapter, we review recent work from a small number of research groups concerning how biological visual systems extract information about surfaces (albedo and color) and the ﬂow of light in scenes outside the Mondrian singularity. We describe in more detail three sets of experiments, the ﬁrst testing whether human observers can compensate for changes in surface orientation and examining what illuminant cues they may be using, and the second examining whether human observers can compensate for changes in surface location in scenes with a strong illuminant gradient in depth. In the third set of experiments, we assess human ability to estimate the light ﬁeld and discriminate changes in the light ﬁeld from other changes within a scene. The experimental results indicate that human color vision is well equipped to solve these apparently more complex problems in surface color perception outside the Mondrian singularity (for a review, see Maloney et al., 2005).

13.2

Lightness and color perception with changes in orientation

Boyaci et al. (2003) investigated how human observers compensate for changes in surface orientation in binocularly viewed, computer-rendered scenes illuminated by a combination of neutral collimated3 and diffuse light sources. The simulated source was sufﬁciently far from the rendered scene that it could be treated as a collimated source. The observer’s task was to match the lightness (perceived albedo) of a test surface within the scene to a nearby lightness scale. The orientation of the test patch with respect to the collimated source was varied, and the question of interest was whether observers would compensate for test patch orientation, partially or fully. Previous work had found little or no compensation (Hochberg and Beck, 1954; Epstein, 1961; Flock and Freedberg, 1970; Redding and Lester, 1980) (see Boyaci et al. (2003) for details). The methods and stimuli employed were similar to those of Boyaci et al. (2004), which we present in more detail below. In contrast to previous researchers, Boyaci et al. (2003) found that observers compensated substantially for changes in test patch orientation. Ripamonti et al. (2004) reached the same conclusions using scenes of similar design, composed of actual surfaces (not computer-rendered) viewed under a combination of collimated and diffuse light sources. The conclusion of both studies was that the visual system partially compensates for changes in surface orientation in scenes whose lighting model consists of a combination of a diffuse and a collimated source. 3

For simplicity in rendering, collimated sources were approximated by point sources that were distant from the rendered scene. Elsewhere, we refer to these sources as “punctate.” The difference is only terminological.

285

286

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner Boyaci et al. (2004) examined judgments of surface color in a similar experiment. The lighting model consisted of a distant collimated yellow light source (“sun”) and a diffuse blue light source (“sky”). The test surface was illuminated by a mixture of the two that depended on the orientation of the test surface and the lighting model. The observer’s task was to set the test patch to be achromatic (an achromatic-setting task). To do so, the observer ﬁrst needed to estimate the blue–yellow balance of the light incident on the test patch, which was itself part of the spatial organization of a scene. Next, the observer needed to set the chromaticity of the light emitted by the surface to be consistent with that of an achromatic surface. The collimated light source was simulated to be behind the observer at elevation 30◦ and azimuth −15◦ (on the observer’s left) or 15◦ (on the observer’s right). The location of the light source remained constant during an experimental block. In every trial, each of four naive observers was presented with a rendered scene and asked to adjust a test surface to be achromatic. Scenes were rendered as a stereo image pair and viewed binocularly. A typical scene is shown in Figure 13.4. The test patch was always in the center and at the front

R

L Crossed

Figure 13.4 A scene from Boyaci et al. (2004). Observers viewed rendered scenes binocularly. The two images permit crossed binocular fusion. The scenes were rendered with a combination of yellow collimated and blue diffuse light sources. The collimated source was always behind the observer, to the observer’s left in half the trials and to the right in the remainder. The orientation of the test surface varied in azimuth and elevation from trial to trial. The observer’s task was to set the test surface in the center of the scene to be achromatic. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756 ).

Surface color perception and light ﬁeld estimation in 3D scenes of the scene, and additional objects were provided as possible cues to the light ﬁeld. From trial to trial, the orientation of the test patch was varied in either azimuth or elevation, but not both. The dependent measure of interest was the relative blue intensity B in the observer’s achromatic setting (for details, see Boyaci et al., 2004). In theory, as the angle between the normal to the test surface and the direction to the yellow collimated light source increases, the observer should make achromatic settings that are “bluer.” Boyaci et al. (2004) derived setting predictions for an ideal observer who chose achromatic settings that were color-constant, always picking the setting consistent with a test surface that was achromatic. These setting predictions are plotted in Figure 13.5a. There are two plots, one for the collimated light on the observer’s left and one for the light on the observer’s right. In each plot, the relative blue intensity B is plotted versus the azimuth of the test surface (solid curve) and versus the elevation (dashed curve). It is important to realize that each curve reaches a minimum when the test patch’s orientation matches the direction of the yellow collimated light source. The results are shown in Figure 13.5b for the subject closest to the ideal observer. All four subjects substantially discounted the effect of changes in orientation. Boyaci et al. (2004) were able to recover crude estimates of the light source direction from each observer’s achromatic settings by estimating and plotting the minima of the four curves (Figure 13.6). There are four estimates of the azimuth (one for each observer) for the light source on the left (Figure 13.6a), and four for the light source on the right (Figure 13.6b). There are eight corresponding estimates of the elevation. The eight estimates of the elevation (Figure 13.6c) are within 10◦ of the true values. The outcome of the experiment of Boyaci et al. (2004), together with the results of Boyaci et al. (2003) and Ripamonti et al. (2004), implies that the observer’s visual system effectively develops an equivalent illumination model (Boyaci et al., 2003) for a scene and uses this model to estimate the albedo and surface color of surfaces at different orientations. In order to do so, the visual system must use the cues present within the scene itself. In a more recent experiment (Boyaci et al., 2006a), we examined three possible “cues to the lighting model” that were present in the scenes described above: cast shadows, surface shading, and specular highlights. We asked the observers to judge the lightness of a rotating central test patch embedded in scenes that contained various cues to the lighting model. The methods and stimuli were similar to those in Boyaci et al. (2003) and the ﬁrst experiment described above. We compared four conditions: the all-cues-present condition, where all three

287

288

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner Ideal observer 0.4

0.4 Punctate on left

Punctate on right

0.3

0.3

Λ

Λ

B

B

0.2

15

0.2 –15

30

30

0.1

0.1 –60

–30 ΨT

0

30

60

–60

–30

0

ΨT

ϕT

30

60

ϕT

(a)

BH : right

BH : left 0.3 Discount index=0.77

ΛB

Discount index=0.8

0.3

18.4

ΛB

–23.1 0.2

0.2 36.6

–60

–30 0 30 ϕT ΨT

36.1

60

–60 (b)

–30 ΨT

0

30

60

ϕT

Figure 13.5 Achromatic-setting results from Boyaci et al. (2004). The dependent variable was the amount of blue (blue/total, or relative blue) in the observer’s achromatic setting. (a) The settings for an ideal observer who perfectly compensated for changes in test patch orientation and collimated-source position. The left graph contains a plot of settings for trials where the collimated source was at 30◦ elevation and −15◦ azimuth (above and behind the observer, to his/her left). The right graph contains a plot of settings for trials where the collimated source was at 30◦ elevation and 15◦ azimuth (above and behind the observer, to his/her right). In both graphs, the horizontal axis is used to plot either the azimuth or the elevation of the test surface. The vertical axis is the relative blue intensity in the observer’s settings. The solid curve in each graph signiﬁes the settings of the ideal observer that compensate for changes in the test surface azimuth. The dashed curve in each graph signiﬁes the settings of the ideal observer that compensate for changes in the test surface elevation. Note that both curves reach a minimum when the test surface is closest in azimuth or elevation to the “yellow” collimated source. (b) Settings of one observer, from Boyaci et al. (2004). The format is identical to that of (a). The lines through the data are based on an equivalent illumination model not described here (see Boyaci et al., 2004).

Surface color perception and light ﬁeld estimation in 3D scenes Elevation estimates

Azimuth estimates

Azimuth estimates

True light source

Ψp

Ψp

Ψp

True light source True light source (a)

(b)

(c)

Figure 13.6 Estimates of collimated-source direction from Boyaci et al. (2004). For each observer, Boyaci et al. (2004) estimated the minimum of the achromatic-setting curves for each subject (Figure 13.4) and interpreted these as estimates of the collimated-light-source direction. The true value is plotted as a dashed line, and the observer’s estimates as solid lines. (a) Azimuth estimates, collimated source at −15◦ azimuth. (b) Azimuth estimates, collimated source at 15◦ azimuth. (c) Elevation estimates.

cues were present in the scene; the cast-shadows only condition; the shadingonly condition; and the specular-highlights-only condition. Boyaci et al. found that observers corrected for the test patch orientation in all four cue conditions, suggesting that they used each candidate cue in isolation. We also performed a reliability analysis to address to what extent observers combined the cues when all three cues were present (all-cues condition). The results of that analysis indicated that the reliability of the observers’ settings in the all-cues condition was higher than for the best individual cue (“effective cue combination” (Oruc et al. (2003)); nevertheless, it was smaller than the reliability predicted by optimal cue combination rules. In the next section, we describe two experiments where the orientation of the test surface never changed. Instead, there was a strong gradient of illumination in depth within the scene and the test surface was moved from a dimly lit region to a brightly lit region. The only difference between the two experiments was the presence of specular surfaces that served as candidate illumination cues. A comparison of performance in the two experiments revealed whether observers used these specular cues to estimate the light ﬁeld. 13.3

Lightness perception with changes in location

In indoor scenes, the light ﬁeld can vary markedly with location as walls serve to block or reﬂect incident light. The celebrated experiments of Gilchrist (1977, 1980) (see also Kardos, 1934) demonstrated that the visual system partially compensates for light ﬁelds that vary across space. Gilchrist et al.

289

290

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner (1999) proposed that observers segment complex scenes into illumination frameworks (Katz, 1935; Gilchrist and Annan, 2002) and discount the illumination (light ﬁeld) within each framework. The rules for organizing frameworks and assigning surfaces to them are complex and not fully understood. The threedimensional structure of a scene could also guide the segmentation of scenes into frameworks, and it is likely that there are analogous effects of threedimensional organization on color perception (e.g., Bloj et al., 1999; Doerschner et al., 2004). Ikeda et al. (1998) examined lightness perception in scenes comprising two small rooms arranged in depth with a doorway between them, patterned after Gilchrist (1977). The lighting of the rooms was complex, consisting of multiple ﬂuorescent tubes placed above both rooms, and the observer could not see these light sources. The intensity of the light incident on a test surface placed along the line of sight through the doorway varied with depth. Ikeda et al. (1998) measured the apparent lightness for surfaces at different depths. Their observers viewed a test square placed at several different depths along the line of sight and passing in depth through the centers of both rooms. The observers’ task was to match the test square to a lightness scale. Ikeda et al. found that observers substantially discounted the actual illumination proﬁles at different depths in the scene. We next describe two experiments by Snyder et al. (2005) using rendered scenes similar in design to those of Gilchrist (1977) and Ikeda et al. (1998). All scenes were presented binocularly and consisted of two rooms arranged along the line of sight with walls composed of random, achromatic Mondrians. A top view of the simulated scenes is shown in Figure 13.7a. The far room was lit by two light sources not visible to the observer. The near room was lit by diffuse light only. The test surface (called a standard surface from this point onwards) varied in depth from trial to trial as shown. The observer adjusted an adjustable surface in the near room until the standard and adjustable surfaces seemed to “be cut from the same piece of paper.” In the second experiment, a candidate cue was also added to the spatial distribution of the illumination: 11 specular spheres placed at random in the scene (but never in front of either the standard or the adjustable surface). The relative luminance of the light (with respect to the back wall of the far room) is plotted in Figure 13.7b. It varied by roughly a factor of ﬁve from the far room to the near room. An example of a scene with spheres (for Experiment 2) is shown in Figure 13.8. The scenes for Experiment 1 were similar but lacked specular spheres. The results of Snyder et al. (2005) for ﬁve subjects, four naive and one (JLS) an author of the study, are shown in Figure 13.9. In both experiments, Snyder et al. estimated the ratio of the luminance of the standard surface to that of the adjustable surface (the relative luminance) at each location in the room. If

Surface color perception and light ﬁeld estimation in 3D scenes Adjustable surface

Hidden light sources

Doorway

Standard surfaces

wall

Far

(a)a

Observer

b (b)

Relative luminance

1.0

0.8

0.6

0.4 0.2 Far wall

Doorway 1

2

3

4

5

6

7

8

9

10

Relative depth, d Figure 13.7 Schematic illustration of the scenes used by Snyder et al. (2005). (a) schematic top view of the scenes. (b) The actual relative illumination proﬁle: the intensity of light incident on a matte surface perpendicular to the observer’s line of sight as a function of depth.

the observers were lightness-constant, these settings would follow the relativeillumination proﬁle in Figure 13.7, which is replotted with the results for each subject in Figure 13.9. The horizontal dashed line corresponds to the settings that would be chosen if the observers were simply matching luminance. The results for Experiment 1 are plotted with hollow circles, and those for Experiment 2 with ﬁlled circles.

291

292

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner

R

L

Figure 13.8 A scene from Snyder et al. (2005). Observers viewed rendered scenes binocularly. The two images permit crossed binocular fusion. The scenes were rendered with a combination of collimated and diffuse light sources. The collimated sources were in the far room behind the wall containing the doorway. The standard patch was in the center of the scene and moved from trial to trial along the observer’s line of sight in depth. The adjustable patch was next to the doorway on the right. The observer adjusted the luminance of this patch until its lightness (perceived albedo) matched the lightness of the test patch.

Snyder et al. concluded that all observers signiﬁcantly discounted the gradient of illumination in depth in both experiments and that their degree of constancy signiﬁcantly improved with the addition of the specular spheres. 13.4

Representing the light field

Boyaci et al. (2004) found that observers could compensate for changes in surface orientation in scenes illuminated by a single yellow collimated light and a blue diffuse light. Boyaci et al. (2003) and Ripamonti et al. (2004) came to similar conclusions for asymmetric lightness matches in scenes illuminated by a single collimated source and a diffuse source. These results indicate that the observers’ visual systems effectively estimated a representation of the spatial and chromatic distribution of the illuminant at the points in the scene probed. But what if there is more than one collimated light source? Doerschner et al. (2007) investigated whether the visual system can represent and discount more complex spatial and chromatic light ﬁelds. The argument of Doerschner et al. (2007) is based on mathematical results due to Basri and Jacobs (2001) and Ramamoorthi and Hanrahan (2001a,b) that we will not reproduce here. The key idea is that a Lambertian surface effectively blurs the light ﬁeld incident on it so that, for example, multiple collimated

Surface color perception and light ﬁeld estimation in 3D scenes 1.0 0.8 0.6 0.4 0.2 0

JLS

1.0 0.8 0.6 0.4 0.2 0

Relative luminance

0 2 4 6 8 10 1.0 0.8 0.6 0.4 0.2 0

RDP 0 2 4 6 8 10

1.0 0.8 0.6 0.4 0.2 0

1.0 0.8 0.6 0.4 0.2 0

PXV 0 2 4 6 8 10

SMT 0 2 4 6 8 10

VWC 0 2 4 6 8 10 Relative depth, d

Figure 13.9 Results from Snyder et al. (2005). The relative luminance of the observers’ lightness matches is plotted as a function of depth, with specular spheres (solid circles) and without (open circles). The actual relative-illumination proﬁle is also included as a solid curve. An observer with perfect lightness constancy would have settings on this line. The horizontal dashed line signiﬁes settings for an observer with no lightness constancy (luminance matching). The observers partially discounted the actual gradient of light intensity, with and without the specular spheres. With the specular spheres, their performance was markedly closer to that of a lightness-constant observer. The results suggest that the spheres act as cues to spatial variations in the light ﬁeld.

sources that arrive from almost the same direction are equivalent to a single extended source with a single maximum of intensity. Surprisingly, even when the angle between the collimated sources is as great as 90◦ , they effectively merge into a single extended source. When the separation is as great as 160◦ , the effective light ﬁeld has two distinct maxima. The goal of Doerschner et al. was to determine whether the human visual system could discriminate between these two conﬁgurations. The stimuli were computer-rendered 3D scenes, containing a rectangular test patch at the center. Observers viewed the stimuli in a stereoscope. The scenes

293

294

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner

Figure 13.10 Example of scene illumination conditions from Doerschner et al. (2007). Only the left image of each stereo pair is shown. The scenes were illuminated by a combination of a diffuse blue source and two yellow collimated sources either 90◦ apart (left) or 160◦ apart (right).

were illuminated by a combination of a diffuse blue source and two yellow collimated sources placed symmetrically about the observer’s line of sight and either 90◦ apart or 160◦ apart. Examples of the stimuli are shown in Figure 13.10. A condition with a blue diffuse source and a single yellow collimated source was included as a control. The orientation of the test patch was randomly varied among nine orientations from −60◦ to 60◦ . In each trial, the observer adjusted the color of the test patch until it was perceived to be achromatic. We analyzed the amount of relative blue in the observers’ achromatic settings as a function of test patch orientation (just as in Boyaci et al., 2004). Six naive observers repeated each orientation-and-light condition 20 times. We ﬁtted a generalization of the equivalent illumination model developed by Boyaci et al. (2003, 2004) (the model of Bloj et al. (2004) is essentially identical to that of Boyaci et al. (2003)) to predict the settings at each test patch orientation for an ideal observer with imperfect knowledge of the spatial and chromatic distribution of the illuminants. The observers systematically discounted the relative contributions of diffuse and collimated light sources at the various test patch orientations for all illuminants. We conclude that the visual system effectively represents complex lighting models that include multiple collimated sources. Doerschner et al. (2007) argued further that the ability of the human visual system to discriminate the presence of multiple light sources is well matched to the task of estimating the surface color and lightness of Lambertian surfaces (Figure 13.11). We will not reproduce their argument here.

Surface color perception and light ﬁeld estimation in 3D scenes 0.5

IB

AS 0.4

0.3

ΛB

0.2 0.5

–60 –40 –20 0 20 40 60 –60 –40 –20 0 20 40 60 MS

SK

0.4

0.3

0.2 –60 –40 –20 0 20 40 60 –60 –40 –20 0 20 40 60

ϕT

Figure 13.11 Data and model ﬁts for the 160◦ condition from Doerschner et al. (2007). B is plotted as a function of the test patch orientation ϕT . The ﬁgure shows observers’ data (diamond symbols). The error bars are plus or minus twice the standard error of the mean, which corresponds approximately to the 95% conﬁdence interval. The ﬁgure illustrates clearly that the data are ﬁtted better when the observer’s equivalent illumination model is approximated with a 9D spherical harmonic subspace (solid line) than with a 4D harmonic subspace (dashed line), indicating that the visual system can resolve directional variation in the illumination up to at least a 9D subspace. All ﬁts were obtained by means of maximum likelihood estimation, as described in Doerschner et al. (2007).

13.5

The psychophysics of the light field

The work discussed above demonstrates that observers discount the effects of the light ﬁeld when interpreting the color or albedo of a surface, suggesting that the observers construct an internal representation of the light ﬁeld, as described by Boyaci’s et al.’s (2003) equivalent lighting model. In this section, we review work from two laboratories which have directly studied the psychophysics of the light ﬁeld. First, we describe the work of Koenderink et al. (2007), who directly evaluated the human visual system’s ability to estimate the light ﬁeld in a novel experiment. Their results suggest that observers estimate accurately a light ﬁeld ﬁlling the entire visual space. Second, we review work by Gerhard et al. (2007) on the temporal dynamics of light ﬁeld inference, which

295

296

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner

Final setting:

Figure 13.12 Light ﬁeld estimation. Stimuli from Koenderink et al. (2007). Observers set the central sphere to agree with the local light ﬁeld as inferred from the scene.

revealed that observers detect changes in the light ﬁeld rapidly and accurately. Last, we show that observers effectively use this estimated light ﬁeld to improve sensitivity to detecting concomitant changes in surface color. Koenderink et al. (2007) took stereo photographs of real scenes containing matte-white-painted penguins facing each other, standing on a matte gray ground (Figure 13.12). Three lighting conditions were used: one simulating daylight with a distant collimated light source, one simulating an overcast day, and one simulating a “nativity scene” painting, in which the sole light source was at the feet of the group in the middle of the circle. A white matte sphere was also photographed at various positions, either ﬂoating or resting on the ﬂoor of the scene. During the experiment, observers viewed stereo photographs of the scene where the matte sphere was replaced by a virtual probe sphere with adjustable shading. The probe’s shading started at a random setting, and the observer’s task was to adjust four light ﬁeld parameters until the sphere appeared to ﬁt well into the scene. There were two position parameters, tilt and slant, and two quality parameters, directedness and intensity. Using four sliders, the observers spent typically one minute adjusting the lighting online until the sphere’s shading looked correct. In order to produce these settings, it was necessary for the observers to infer the properties of the light ﬁeld using only the penguins and the ground plane as cues to the spatial variance of the light ﬁeld’s intensity and then to infer how that light ﬁeld would shade a novel object placed in the scene.

Surface color perception and light ﬁeld estimation in 3D scenes

rotated view

bird’s eye view

Figure 13.13 Stimuli from Gerhard et al. (2007) and Gerhard and Maloney (2008). Observers viewed rendered scenes of achromatic concave and convex pyramids ﬂoating in a black space. On the left is a rotated view illustrating the three-dimensional structure of the scene, and on the right is the bird’s-eye view which the observer had through a stereoscope.

The observers’ light ﬁeld settings varied monotonically with the veridical values, with the observers being particularly keen at setting the light source tilt correctly (within 10◦ of the veridical tilt). The correlations between the images produced by the observers settings and the predicted probe images computed from the true lighting settings were quite high, with R 2 values for the regressions in the range 0.7–0.9. However, the comparison of observers’ settings with the veridical light ﬁeld parameters was not the most important part of the analysis of these results; the observers’ settings need not reproduce the true lighting conditions, as the internal representations of the light ﬁeld might be subject to systematic errors as other domains of visual processing are (Boyaci et al., 2003; Ripamonti et al., 2004). The important result is that the observers were remarkably consistent both across sessions and with each other. The light source direction settings were both highly precise and reproducible across sessions. The light quality settings, i.e., directedness and intensity, were similar between observers and fairly reproducible. Koenderink et al.’s (2007) novel method conﬁrmed that human observers can estimate the spatial variation of the light ﬁeld across the entire visual space. Gerhard et al. (2007) constructed rendered three-dimensional scenes which allowed precise control over image luminances in order to evaluate dynamic light ﬁeld perception. The scenes were grayscale pyramids ﬂoating in a black space, viewed stereoscopically from above. Example stimuli are seen in Figure 13.13.

297

298

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner The sides of the pyramids were various shades of gray and were illuminated according to Lambert’s cosine law by a distant collimated source and a diffuse source one-quarter as intense. Rendered trials were created where the collimated source rotated angularly in one of four directions by 2, 4, 6, or 8◦ , over two 1 s frames. Importantly, three-dimensional shape was necessary to disambiguate the changes, as some pyramids were concave and others convex, determining the ﬂow of shading as the light source moved. This “yoked” stimulus design was used in order to address the only previous work on light ﬁeld change, in which changes in luminance ratios between adjacent surfaces were predictive of surface color changes and not light changes, which preserve edge ratios in ﬂat-world scenes (Foster and Nascimento, 1994). However, in three-dimensional scenes, edge ratios vary as a light source rotates about the scene; see Figure 13.14 for an illustration.

(a)

(b)

Figure 13.14 The effect of light ﬁeld transformations on luminance edge ratios. (a) In a ﬂat world with a homogeneous light ﬁeld, an intensity or color change in the light ﬁeld does not affect the ratio of luminance between two adjacent surfaces. (b) In a three-dimensional world in which the position of a point light source changes, the ratio of the luminances of adjacent surfaces changes.

Surface color perception and light ﬁeld estimation in 3D scenes For each light-transformation trial, we created a “yoked” nonlight trial, in which each pyramid’s luminance values were retained in both frames, including the luminance ratios between adjacent edges, but the pyramids were rotated at random within the scene so that the resulting change in scene was not consistent with any possible change in the direction of the collimated light source. In each trial, observers were ﬁrst asked to judge whether the scene changes were consistent with a change in location of the collimated source or not. If they reported that the scene changes were consistent with a change in location of the collimated source, they were then asked to report the direction in which the light source had moved. At the lowest magnitude of light source movement, 2◦ , two of the four observers were above chance in discriminating the globally consistent light change from the inconsistent version, and at 4◦ , all four observers were well above chance in discriminating the changes, with discriminability increasing with the magnitude of the angular rotation. Importantly, all observers were above chance in reporting the direction of the lighting change even in trials with the smallest magnitude of the change in lighting direction. Discriminability measures are plotted in Figure 13.15, and results for the ability to defect the movement direction of the light are plotted in Figure 13.16. The observers excelled at discriminating the changes even at low magnitudes, indicating that the human visual system is sensitive to small light-ﬁeld-induced changes that disturb luminance edge ratios. The nature of the discrimination in this experiment required processing of the three-dimensional structure of the scene. Given only one of the two images in a stereo pair, it was not possible to accurately predict the direction of movement of the light source. This result demonstrates that light ﬁeld perception cannot be modeled by image-based computations on a single image and cannot be modeled by algorithms making use only of changes in edge ratios. In a second experiment, Gerhard and Maloney (2008) investigated whether an observer’s ability to discriminate changes in the light ﬁeld would aid the observer in detecting simultaneous changes in surface color. If, for example, the visual system can accurately estimate the changes in luminance due to a light ﬁeld change, then it could potentially detect a further change in surface albedo more accurately. In half of the trials chosen at random, Gerhard and Maloney added a surface albedo perturbation on top of the two global scene changes presented in their earlier work. Instead of detecting globally consistent or inconsistent light ﬁeld changes, observers were asked to detect whether one facet of one pyramid at

299

300

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner 4 O1

4 O2

2

2

0

0

4 O3

4 O4

2

2

8˚

0

0 2˚

6˚ 8˚ 2˚ 4˚ 4˚ Angular rotation of point light

6˚

8˚

Figure 13.15 Light ﬁeld discriminability results from Gerhard et al. (2007). The discriminability of light ﬁeld changes from matched nonlight changes, measured as d (Green and Swets, 1966), is plotted for each observer as a function of the degree of angular rotation that the collimated source underwent. The d for each level of change magnitude was calculated from response data for a set of trials, half of which contained globally consistent light changes and the other half of which were statistically matched trials that did not contain a global light ﬁeld change. Nonparametric bootstrapping was used to obtain 95% conﬁdence intervals, and the results indicate performance above chance for all observers at all levels except the lowest level, at which observers 1 and 3 were at chance. On the right is an 8◦ angle for reference; it is the largest magnitude of light source rotation tested. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

a random location in the scene had changed albedo. In half of the trials, one facet did increase or decrease in luminance at random. Trials were prepared in the same yoking fashion as before to preserve luminance edge ratio information, as well as all other low-order image statistics. If the visual system can more accurately detect surface changes simultaneously with a change in the position of the collimated light source than simultaneously with matched changes not consistent with a change in the position of the collimated source, then the d measures should be higher than in the previous experiment. The observers demonstrated an average beneﬁt of 1.65 times higher albedo perturbation sensitivity when the global change could be perceived as lightﬁeld-induced. The improvement was signiﬁcant (t = 3.27; p < 0.001). See Figure 13.17 for the perturbation sensitivity for each observer. These results

Surface color perception and light ﬁeld estimation in 3D scenes

Percentage correct

100

75 O1 O2 O3 O4

50

25 2°

4°

6°

8°

Angular rotation of point light Figure 13.16 Light movement direction results from Gerhard et al. (2007). The percentage correct for each observer is plotted as a function of the degree of angular rotation of the light source. All observers were above chance, which was 25%, at discriminating the direction in which the light source had moved in trials in which they detected a true light source rotation.

suggest that the variability in the image luminances was effectively reduced when the observers perceived the global change as driven by a dynamic light ﬁeld, and that the observers discounted the component of change due to the changing light ﬁeld.

13.6

Conclusions

The world in which we live is three-dimensional. Claims about the usefulness of visual information should be based on performance in threedimensional environments. Many researchers in color vision have limited their choice of experimental conditions to conditions that are very different from the world in which we live. Such studies have, consequently, yielded limited results. A fruitful alternative is to examine human color perception in three-dimensional scenes that contain veridical cues to the light ﬁeld. In this chapter, we ﬁrst reviewed recent work by researchers on the evaluation of surface color, lightness perception, and constancy in three-dimensional scenes with moderately complex lighting models, and we presented two recent studies in detail. The implication of this research is that the human visual system can compensate for spatially and spectrally inhomogeneous light ﬁelds. In the discussion, we found that performance is affected by the availability of

301

302

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner O1

light non-light

3

O2 3

2

2

1

1

0

0

O3

O4

3

3

2

2

1

1

0

0 0.5

0.75

1

0.5

0.75

1

Perturbation level Figure 13.17 Detection of albedo change, from Gerhard and Maloney (2008). The detectability of an albedo perturbation, measured as d (Green and Swets, 1966), is plotted for each observer as a function of the level of albedo perturbation, expressed as a factor multiplying the original surface reﬂectance. Nonparametric bootstrapping was used to obtain 95% conﬁdence intervals; arrows denote conﬁdence intervals that include inﬁnity, indicating that the task was trivial for some observers at some perturbation magnitudes. The open circles show the detectability of albedo perturbations in global light ﬁeld change trials, and the ﬁlled circles are for image-statistic-matched nonlight trials. The average beneﬁt for albedo perturbation detectability under a light ﬁeld change was a 1.65 increase in d . A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/9781107001756).

specular illuminant cues that signal the light ﬁeld (Snyder et al., 2005; Boyaci et al., 2006b). The results of Snyder et al. are particularly interesting. The stimuli were presented binocularly and, if we view either of the binocular images in isolation, we see that the only change in the stimulus from trial to trial is small shifts to the left and right against an otherwise constant background (Figure 13.8). If we were to attempt to explain the perceived lightness of the test surface in terms of its immediate surroundings, then we would only predict that there would be little or no trial-to-trial variation. Yet we ﬁnd large changes in perceived lightness (Figure 13.9) as a function of depth. The binocular-disparity cues that lead to an altered perception of lightness are not present in either image alone. These results are consistent with those of Ikeda et al. (1998). Color perception

Surface color perception and light ﬁeld estimation in 3D scenes in three-dimensional scenes cannot be readily predicted given only the results of experiments on Mondrian scenes. We next described work by Doerschner et al. (2007) testing whether human observers can compensate for light ﬁelds generated by more than one collimated light source and found that they could do so. Last, we looked at human ability to estimate the light ﬁeld and changes in the light ﬁeld and whether this ability beneﬁted human ability to detect changes in surface albedo in scenes where the lighting is changing. It did so. We emphasize that the human ability demonstrated in these last experiments cannot be explained by algorithms limited to consideration of edge ratios and other local computations or even to single images taken from a stereo pair. In the last part of the chapter, we described recent work directly assessing human ability to estimate the light ﬁeld (Koenderink et al., 2007) and whether humans can use estimates of changes in the light ﬁeld to enhance their ability to detect concurrent changes in surface albedo (Gerhard et al., 2007; Gerhard and Maloney, 2008). Reframing the problem of illuminant estimation in terms of combination of veridical cues to the light ﬁeld opens up new and potentially valuable directions for research (Maloney, 1999). In this review, we have focused on surface color perception. It would also be of interest to see how human observers judge color relations (Foster and Nascimento, 1994; Nascimento and Foster, 2000) between surfaces free to differ in orientation and location in the kinds of scenes used in the experiments presented here. Equally, it would be of interest to assess how judgments of material properties (Fleming et al., 2003) vary in such scenes. From the broadest perspective, a full description of human color perception requires that we examine how the human visual system operates in fully three-dimensional scenes with adequate cues to the illuminant. Understanding human color perception in the case of the Mondrian singularity remains an extremely important research topic, and work in this area is contributing to our understanding of visual perception. The work described here serves to complement and extend the large existing body of literature on surface color perception. Acknowledgments This research was funded in part by Grant EY08266 from the National Eye Institute of the National Institutes of Health and by an award from the Humboldt Foundation. The initial part of the article is based on Boyaci et al. (2006b). The preliminary results from Gerhard and Maloney (2008) described here have now been published in ﬁnal form as Gerhard and Maloney (2010).

303

304

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner References Adelson, E. H. and Bergen, J. R. (1991). The plenoptic function and the elements of early vision. In M. S. Landy and J. A. Movshon (eds.), Computational Models of Visual Processing, pp. 3–20. Cambridge, MA: MIT Press. Adelson, E. H. and Pentland, A. P. (1996). The perception of shading and reﬂectance. In D. Knill and W. Richards (eds.), Perception as Bayesian Inference, pp. 409–423. New York: Cambridge University Press. Arend, L. E. and Spehar, B. (1993a), Lightness, brightness, and brightness contrast: 1. Illuminance variation. Percept. Psychophys., 54: 446–456. Arend, L. E. and Spehar, B. (1993b), Lightness, brightness, and brightness contrast: 2. Reﬂectance variation. Percept. Psychophys., 54: 457–468. Basri, R. and Jacobs, D. (2001). Lambertian reﬂectance and linear subspaces. In Proceedings of the IEEE International Conference on Computer Vision, Vancouver, pp. 383–390. Bäuml, K.-H. (1999). Simultaneous color constancy: how surface color perception varies with the illuminant. Vis. Res., 39: 1531–1550. Belhumeur, P. N., Kriegman, D., and Yuille, A. (1999). The bas-relief ambiguity. Int. J. Comput. Vis., 35: 33–44. Bloj, M. G., Kersten D., and Hurlbert, A. C. (1999). Perception of three-dimensional shape inﬂuences colour perception through mutual illumination. Nature, 402: 877–879. Bloj, M., Ripamonti, C., Mitha, K., Hauck, R., Greenwald, S., and Brainard, D. H. (2004). An equivalent illuminant model for the effect of surface slant on perceived lightness. J. Vis. 4(9): 735–746. Boyaci, H., Maloney, L. T., and Hersh, S. (2003). The effect of perceived surface orientation on perceived surface albedo in three-dimensional scenes. J. Vis., 3: 541–553. Boyaci, H., Doerschner, K., and Maloney, L. T. (2004). Perceived surface color in binocularly-viewed scenes with two light sources differing in chromaticity. J. Vis., 4: 664–679. Boyaci, H., Doerschner, K., and Maloney, L. T. (2006a). Cues to an equivalent lighting model. J. Vis., 6: 106–118. Boyaci, H., Doerschner, K., Snyder, J. L., and Maloney, L. T. (2006b). Surface color perception in three-dimensional scenes. Vis. Neurosci., 23: 311–321. Ciurea, F. and Funt, B. (2004). Failure of luminance-redness correlation for illuminant estimation. In Proceedings of the Twelfth Color Imaging Conference, Scottsdale, AZ, pp. 42–46. Doerschner, K., Boyaci, H., and Maloney, L. T. (2004). Human observers compensate for secondary illumination originating in nearby chromatic surfaces. J. Vis., 4: 92–105. Doerschner, K., Boyaci, H., and Maloney, L. T. (2007). Testing limits on matte surface color perception in three-dimensional scenes with complex light ﬁelds. Vis. Res., 47: 3409–3423.

Surface color perception and light ﬁeld estimation in 3D scenes Dror, R. O., Willsky, A., and Adelson, E. H. (2004). Statistical characterization of real-world illumination. J. Vis., 4: 821–837. Epstein, W. (1961). Phenomenal orientation and perceived achromatic color. J. Psychol., 52: 51–53. Fleming, R. W., Dror, R. O., and Adelson, E. H. (2003). Real-world illumination and the perception of surface reﬂectance properties. J. Vis., 3: 347–368. Flock, H. R. and Freedberg, E. (1970), Perceived angle of incidence and achromatic surface color. Percept. Psychophys., 8: 251–256. Foster, D. H. and Nascimento, S. M. C. (1994). Relational colour constancy from invariant cone-excitation ratios. Proc. R. Soc. Lond. B, 257: 115–121. Gerhard, H. E. and Maloney, L. T. (2008). Albedo perturbation detection under illumination transformations: a dynamic analogue of lightness constancy. [Abstract.] J. Vis., 8: 289. Gerhard, H. E. and Maloney, L. T. (2010). Detection of light transformations and concomitant changes in surface albedo. J. Vis. 10: 1–14. Gerhard, H. E., Khan, R., and Maloney, L. T. (2007). Relational color constancy in the absence of ratio constancy. [Abstract.] J. Vis., 7: 459. Gershun, A. (1936/1939). Svetovoe Pole [The Light Field]. Moscow, 1936. Translated by P. Moon and G. Timoshenko (1939) in J. Math. Phys., 18: 51–151. Gilchrist, A. L. (1977). Perceived lightness depends on spatial arrangement. Science, 195: 185–187. Gilchrist, A. L. (1980). When does perceived lightness depend on perceived spatial arrangement? Percept. Psychophys., 28: 527–538. Gilchrist, A. L. and Annan, A., Jr. (2002). Articulation effects in lightness: historical background and theoretical implications. Perception, 31: 141–150. Gilchrist, A. L., Kossyﬁdis, C., Bonato F., Agostini, T., Cataliotti, J., Li, X. J., Spehar, B., Annan, V., and Economou, E. (1999). An anchoring theory of lightness perception. Psychol. Rev., 106: 795–834. Golz, J. and MacLeod, D. I. A. (2002). Inﬂuence of scene statistics on colour constancy. Nature, 415: 637–640. Granzier, J. J. M., Brenner, E., Cornelissen, F. W., and Smeets, J. B. J. (2005). Luminance–color correlation is not used to estimate the color of the illumination. J. Vis., 5: 20–27. Green, D. M. and Swets, J. A. (1966). Signal Detection Theory and Psychophysics. New York: Wiley. Hara, K., Nishino, K., and Ikeuchi, K. (2005). Light source position and reﬂectance estimation from a single view without the distant illumination assumption. IEEE Trans. Pattern Anal. Machine Intell., 27: 493–505. Haralick, R. M. and Shapiro, L. G. (1993). Computer and Robot Vision, Vol. 2. Reading, MA: Addison-Wesley. Henderson, S. T. (1977). Daylight and Its Spectrum, 2nd edn. Bristol: Adam Hilger. Hochberg, J. E. and Beck, J. (1954). Apparent spatial arrangements and perceived brightness. J. Exp. Psychol., 47: 263–266.

305

306

L. T. Maloney, H. E. Gerhard, H. Boyaci, and K. Doerschner Hurlbert, A. C. (1998). Computational models of colour constancy. In V. Walsh and J. Kulikowski (eds.), Perceptual Constancy: Why Things Look as They Do, pp. 283–322, Cambridge: Cambridge University Press. Ikeda, M., Shinoda, H., and Mizokami, Y. (1998). Three dimensionality of the recognized visual space of illumination proved by hidden illumination, Opt. Rev., 5: 200–205. Kaiser, P. K. and Boynton. R. M. (1996). Human Color Vision, 2nd edn. Washington, DC: Optical Society of America. Kardos, L. (1934). Ding und Schatten; Eine experimentelle Untersuchung über die Grundlagen des Farbsehens. Z. Psychol. Physiol. Sinnesorgane, Ergänzungsband, 23. Leipzig, Germany: Verlag von J. A. Barth. Katz, D. (1935). The World of Colour. London: Kegan, Paul, Trench, Trubner and Co. Koenderink, J. J. and van Doorn, A. J. (1996). Illuminance texture due to surface mesostructure. J. Opt. Soc. Am. A, 13: 452–463. Koenderink, J. J., van Doorn, A. J., and Pont S. C (2004). Light direction from shad(ow)ed random Gaussian surfaces. Perception, 33: 1403–1404. Special issue, Shadows and Illumination II. Koenderink, J. J., Pont S. C., van Doorn, A. J., Kappers, A. M. L., and Todd J. T. (2007). The visual light ﬁeld. Perception, 36: 1595–1610. Land, E. H. and McCann, J. J. (1971). Lightness and retinex theory. J. Opt. Soc. Am., 61: 1–11. Lee, R. L., Jr. and Hernández-Andrés, J. (2005a). Short-term variability of overcast brightness. Appl. Opt., 44: 5704–5711. Lee, R. L., Jr. and Hernández-Andrés, J. (2005b). Colors of the daytime overcast sky. Appl. Opt., 44: 5712–5722. MacLeod, D. I. A. and Golz, J. (2003). A computational analysis of colour constancy. In R. Mausfeld and D. Heyer (eds.), Colour Perception: Mind and the Physical World, pp. 205–242. Oxford: Oxford University Press. Maloney, L. T. (1999). Physics-based approaches to modeling surface color perception. In K. R. Gegenfurtner and L. T. Sharpe (eds.), Color Vision: From Genes to Perception, pp. 387–422. Cambridge: Cambridge University Press. Maloney, L. T., Boyaci, H., and Doerschner, K. (2005). Surface color perception as an inverse problem in biological vision. Proc. SPIE, 5674: 15–26. Mausfeld, R. and Andres, J. (2002). Second-order statistics of colour codes modulate transformations that effectuate varying degrees of scene invariance and illumination invariance. Perception, 31: 209–224. Nascimento, S. M. C. and Foster, D. H. (2000). Relational color constancy in achromatic and isoluminant images. J. Opt. Soc. Am. A, 17: 225–231. Oruc, I., Maloney, L. T., and Landy, M. S. (2003). Weighted linear cue combination with possibly correlated error. Vis. Res., 43: 2451–2458. Pont, S. C. and Koenderink, J. J. (2003). Illuminance ﬂow. In N. Petkov and M. A. Wetsenberg (eds.), Computer Analysis of Images and Patterns, pp. 90–97. Berlin: Springer.

Surface color perception and light ﬁeld estimation in 3D scenes Pont, S. C. and Koenderink, J. J. (2004). Surface illuminance ﬂow. In Y. Aloimonos and G. Taubin (eds.), Proceedings of the second International symposium on SD Data Processing Visualization and Transmission, Thessaloniki, Greece. Piscataway, NJ: IEEE. Ramamoorthi, R. and Hanrahan, P. (2001a). On the relationship between radiance and irradiance: determining the illumination from images of a convex Lambertian object. J. Opt. Soc. Am. A, 18: 2448–2458. Ramamoorthi, R. and Hanrahan, P. (2001b). An efﬁcient representation for irradiance environment maps. In Proceedings of SIGGRAPH 2001, Los Angeles, CA, pp. 497–500. New York: ACM Press. Redding, G. M. and Lester, C. F. (1980). Achromatic color matching as a function of apparent test orientation, test and background luminance, and lightness or brightness instructions. Percept. Psychophys., 27: 557–563. Ripamonti, C., Bloj, M., Hauck, R., Kiran, K., Greenwald, S., Maloney, S. I., and Brainard, D. H. (2004). Measurements of the effect of surface slant on perceived lightness. J. Vis., 4: 747–763. Snyder, J. L., Doerschner, K., and Maloney, L. T. (2005), Illumination estimation in three-dimensional scenes with and without specular cues. J. Vis., 5: 863–877. te Pas, S. F. and Pont S. C. (2005). Comparison of material and illumination discrimination performance for real rough, real smooth and computer generated smooth spheres. In Proceedings of the 2nd Symposium on Applied Perception in Graphics and Visualization, A Coruña, Spain, Vol. 95, pp. 75–81. New York: ACM Press. Yang, J. N. and Maloney, L. T. (2001). Illuminant cues in surface color perception: tests of three candidate cues. Vis. Res., 41: 2581–2600.

307

14

Representing, perceiving, and remembering the shape of visual space aude oliva, soojin park, and talia konkle

14.1

Introduction

Our ability to recognize the current environment determines our ability to act strategically, for example when selecting a route for walking, anticipating where objects are likely to appear, and knowing what behaviors are appropriate in a particular context. Whereas objects are typically entities that we act upon, environments are entities that we act within or navigate towards: they extend in space and encompass the observer. Because of this, we often acquire information about our surroundings by moving our head and eyes, getting at each instant a different snapshot or view of the world. Perceived snapshots are integrated with the memory of what has just been seen (Hochberg, 1986; Hollingworth and Henderson, 2004; Irwin et al., 1990; Oliva et al., 2004; Park and Chun, 2009), and with what has been stored over a lifetime of visual experience with the world. In this chapter, we review studies in the behavioral, computational, and cognitive neuroscience domains that describe the role of the shape of the space in human visual perception. In other words, how do people perceive, represent, and remember the size, geometric structure, and shape features of visual scenes? One important caveat is that we typically experience space in a threedimensional physical world, but we often study our perception of space through two-dimensional pictures. While there are likely to be important differences between the perception of space in the world and the perception of space

Vision in 3D Environments, ed. L. R. Harris and M. R. M. Jenkin. Published by Cambridge University Press. © Cambridge University Press 2011.

308

Representing, perceiving, and remembering the shape of visual space mediated through pictures, we choose to describe in this chapter principles that are likely to apply to both media. In the following sections, we begin by describing how the properties of space can be formalized, and to what extent they inﬂuence the function and meaning of a scene. Next, we describe cases in which the perception of the geometry of space is distorted by low- and highlevel inﬂuences. Then, we review studies that have examined how memory of scenes and of position in space is transformed. Finally, we address how people get a sense of the space just beyond the view they perceive, with a review of studies on scene integration. The visual perception of space is, ﬁrst and foremost, observer-centered: the observer stands at a speciﬁc location in space, determined by latitude, longitude, and height coordinates. A view or viewpoint is a cone of visible space as seen from an observer’s vantage point: a view is oriented (e.g., looking up, straight ahead, or down) and has an aperture that the dioptrics of the eyes suggest covers up to 180◦ . However, the apparent visual ﬁeld that human observers visually experience is closer to 90◦ , corresponding to a hemisphere of space in front of them (Koenderink et al., 2009; Pirenne, 1970). These truncated views provide the inputs provided to the brain. All ensuing spatial concepts such as scenes, places, environments, routes, and maps are constructed out of successive views of the world. In this chapter, we introduce two levels of description of environmental spaces: a structural level and a semantic level. The terms space, isovist, and spatial envelope refer to the geometric context of the physical world (structural level); scenes, places, and environments rely on understanding the meaning of the space that the observer is looking at or embedded in (semantic level). Whereas space is deﬁned in physics as the opposite of mass, in our structurallevel description we deﬁne a space as an entity composed of two substances: mass and holes. A space can be of any physical size in the world, for example, 1 m3 or 1000 m3 . The spatial arrangement of mass and holes is the most simpliﬁed version of the three-dimensional layout of the space. From a given viewpoint, the observer has access to a collection of visible surfaces between the holes. The set of surfaces visible from that location if the observer rotates through 360◦ is called an isovist (Benedikt, 1979). An example of an isovist is shown in Figure 14.1. A collection of all isovists visible from all possible locations in a space deﬁnes a complete isovist map of the space. One ﬁnal structural-level description of the spatial layout, as seen from one viewpoint, is the spatial-envelope representation (Oliva and Torralba, 2001). Here, threedimensional spatial layouts correspond to two-dimensional projections that can be described by a statistical representation of the image features. This statistical

309

310

A. Oliva, S. Park, and T. Konkle

Figure 14.1 Two single isovist views are shown, with the dot marking the location from which the isovist was generated. The rightmost image shows the isovist map, which is the collection of isovists generated from all possible locations within the space. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

representation can describe coarsely the shape, size, boundary, and content of the space in view. At the semantic level, scene, place, and environment are terms that refer to the meaning of the physical or pictorial world and are modulated by the knowledge of the observer. In the world, the observer is embedded in a space of a given place: a place is associated with certain actions and knowledge about a speciﬁc physical space (e.g., my kitchen or the White House) or groups of physical spaces (e.g., the category of industrial kitchens or of gymnasiums).1 The term “scene” has two common usages in the literature, as both a particular view and an extended space. Here we deﬁne a scene as a view (or cone of visible space) with an associated semantic meaning. A scene has a “gist” (Friedman, 1979; Oliva, 2005; Potter, 1976), namely a semantic description that comes with associated knowledge (e.g., a kitchen is a place for cooking). A scene depends on one’s view of a space (unlike places, which do not depend on the viewpoint of the observer). Therefore, a place can be composed of one or many scenes: by moving his or her head or moving around a city block, an observer may perceive a shop front, a parking lot, a street, and a park as different scenes. Places and scenes can be conceptualized as part of a larger topology, an environment. Environments 1

It is important to note that the word “place” has acquired different deﬁnitions depending on the domain of study. For instance, place has been used interchangeably with scene in cognitive neuroscience (e.g., Epstein, 2008) when referring to the parahippocampal place area, or PPA. In neuroscience, the term refers to place-cells, which are hippocampus neurons that ﬁre when an animal is at a particular three-dimensional location (e.g., O’Keefe, 1979). Place-cells are speciﬁed by latitude, longitude, and height coordinates, and can also be oriented, for example pointing north with a 30◦ downward angle.

Representing, perceiving, and remembering the shape of visual space would therefore typically refer to physical spaces encompassing one or more scenes and places, typically of a larger scale than a single place.

14.2

Representing the shape of a space

In the following sections, we describe two representations of the structure of space: the isovist representation (Benedikt, 1979) and the spatialenvelope representation (Oliva and Torralba, 2001). Both offer a formal quantitative description of how to represent a space, i.e., the volumetric structure of a scene, place, or environment. The isovist description operates over a three-dimensional model of the environment, and captures information about the distribution and arrangement of visible surfaces. The spatial-envelope description operates over projections of space onto two-dimensional views, and captures information about both the layout and the texture of surfaces. 14.2.1

Isovist representation

Figure 14.1 illustrates the isovist of a laboratory space for a given position in the center of the main room. An isovist represents the volume of space visible from a speciﬁc location, as if illuminated by a source of light at this position. As such, the isovist is observer-centered but viewpoint-independent. It represents the visible regions of space, or the shape of the place, at a given location, obtained from the observer rotating through 360◦ . A concept initially introduced by Tandy (1967), the isovist was formalized by Benedikt (1979). Although Benedikt described an isovist as the volume visible from a given location, in a view-independent fashion, the concept can be simpliﬁed by considering a horizontal slice of the “isovist polyhedron” as illustrated by the single isovists shown in Figure 14.1. The volumetric conﬁguration of a place requires calculating a collection of isovists at various locations: this refers to the isovist ﬁeld or isovist map (Benedikt, 1979; Davis and Benedikt, 1979), shown in Figure 14.1 on the right. High luminance levels indicate areas that can be seen from most of the locations in the main central room of the laboratory and dark areas indicate regions that are hidden from most of the locations. In empty, convex rooms (such as a circular, square, or rectangular room), the isovist ﬁeld is homogeneous, as every isovist from every location has the same shape and volume (or the same area if a two-dimensional ﬂoor plan is considered). The shape of an isovist can be characterized by a set of geometrical measurements (Benedikt, 1979; Benedikt and Burnham, 1985): its area, corresponding to how much space can be seen from a given location; its perimeter length, which

311

312

A. Oliva, S. Park, and T. Konkle measures how many surfaces2 can be seen from that location; its variance, which describes the degree of dispersion of the perimeter relative to the original location; and its skewness, which describes the asymmetry of this dispersion. All of these inform the degree to which the isovist polygon is dispersed or compact. Additional quantitative measurements of isovists have included the number of vertices (i.e., the intersections of the outlines of the isovist polygon) and the openness of the polygon. The openness of an isovist is calculated as the ratio between the length of open edges (generated by occlusions) and the length of closed edges (deﬁned by solid visible boundaries (Psarra and Gradjewski, 2001; Wiener and Franz, 2005). From simple geometrical measurements of isovists and isovist maps, higherlevel properties of the space can be derived: its occlusivity (i.e., the depth to which surfaces inside the space overlap with each other (Benedikt, 1979),3 its degree of compactness (a measure deﬁned by a circle whose radius is equal to the mean radial length of the isovist, which indicates how much the isovist’s shape resembles a circle), its degree of spaciousness (Stamps, 2005), and its degree of convexity (also referred to as jaggedness, calculated as the ratio between the squared perimeter of the isovist and its area; see Wiener and Franz (2005) and Turner et al. (2001)). A concave or “jagged” isovist has dents, which means that regions of the place are hidden from view. A circular, convex isovist has no hidden regions. Our understanding of the relationship between geometrical measurements of the isovists and the perception of a scene and a place remains in its infancy. Wiener and Franz (2005) found that the degree of convexity and the openness ratio of isovists correlated with observers’ judgment of the complexity of a space, which in turn, modulates navigation performance in a virtual reality environment. Simple isovist descriptors (area, occluded perimeter, variance, and skewness) predict people’s impressions of the spaciousness of hotel lobbies (Benedikt and Burnham, 1985) and the degree of perceived enclosure of a room or an urban place (Stamps, 2005). Potentially, the perceptual and cognitive factors correlated with isovists and their conﬁguration may be diagnostic of a given type of place or of the function of a space. Furthermore, behavior in a space may be predicted by these structural spatial descriptors. Along these 2

3

In his 1979 paper, Benedikt deﬁned a visible real surface as an “opaque, material, visible surface” able to scatter visible light. This disqualiﬁes the sky, glass, mirrors, mist, and “perfectly black surfaces.” Opaque boundaries are barriers that impede vision beyond them. Occlusivity measures “the length of the nonvisible radial components separating the visible space from the space one cannot see from the original location X” (Benedikt, 1979).

Representing, perceiving, and remembering the shape of visual space lines, Turner and colleagues were able to predict complex social behaviors such as way-ﬁnding and the movement of a crowd in a complex environment (Turner et al., 2001). An analysis of the kinds of space that a human being encounters, and of the geometrical properties that distinguish different kinds of space from each other remains to be done. Furthermore, in its original form, the isovist theory does not account for the types of textures, materials, or colors attached to the surfaces, and this information will likely be important for relating structural descriptions of spaces to human perceptions of spaces or actions within spaces. However, the isovist description does provide a global geometrical analysis of the spatial environment and gives mathematical descriptions to spatial terms such as “vista,” “panorama,” and “enclosure,” which in turn allows us to formalize and predict spatial behaviors of human, animal, and artiﬁcial systems. In the next section, we describe another formal approach for describing the shape of a space, the spatial-envelope representation. 14.2.2

Spatial-envelope representation

Given that we experience a three-dimensional world, it makes sense that we have learned to associate the meaning of a scene with properties that are diagnostic of the spatial layout and geometry, as well as with the objects in view (e.g., while closets typically contain clothes, and gyms typically contain exercise equipment, it is also the case that closets are typically very small places and gyms are large places). In architecture, the term “spatial envelope” refers to a description of a whole space that provides an “instant impression of the volume of a room or an urban site” (Michel, 1996). The concept has been used to describe qualitatively the character and mood of a physical or pictorial space, represented by its boundaries (e.g., walls, ﬂoor, ceiling, and lighting) stripped of movable elements (e.g., objects and furnishing). In 2001, Oliva and Torralba extended this concept and proposed a formal, computational approach to the capture of the shape of space as it would be perceived from an observer’s vantage point (Oliva and Torralba, 2001, 2002, 2006, 2007; Torralba and Oliva, 2002, 2003). The collection of properties describing a space in view is referred to as the spatial-envelope representation. For instance, just as a face can be described by attributes such as its size, gender, age, symmetry, emotion, attractiveness, skin type, or conﬁguration of facial features, a space can be described by a collection of properties such as perspective, size, dominant depth, openness, and naturalness of content. To give an example of these scene properties, a space can be represented by two independent descriptors, one representing the boundaries or external

313

314

A. Oliva, S. Park, and T. Konkle

Natural content

Closed spatial layout

Urban content

Natural content

Open spatial layout

Urban content

Figure 14.2 A schematic illustration of how pictures of real-world scenes can be deﬁned uniquely by their spatial layout and content. Note that the conﬁguration, size, and locations of components can be in correspondence between natural and manufactured environments. If we strip off the natural content of a forest, keeping the enclosed spatial layout, and ﬁll the space with urban content, then the scene becomes an urban street scene. If we strip off the natural content of a ﬁeld, keeping the open spatial layout, and ﬁll the space with urban content, then the scene becomes an urban parking lot. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

features, and one representing the content or internal features (Oliva and Torralba, 2001, 2002, 2006; Park et al., 2011). Boundaries and content descriptors are orthogonal properties: a space can be of various sizes and shapes, and it can have any content in terms of parts, textures, colors, and materials. Figure 14.2 illustrates this point: a space can have either a closed or an open layout of a particular shape (the enclosed layout here is in perspective, and the open layout has a central ﬁgure), with its surface boundaries “painted” with either natural or manufactured content. Oliva and Torralba (2001) discovered that some of the key properties of the spatial envelope (e.g., mean depth, openness, perspective, naturalness, and roughness) have a direct transposition into visual features of two-dimensional surfaces. This allows the calculation of the degree of openness, perspective, mean depth, or naturalness of a scene by capturing the distribution of local image features and determining the visual elements (oriented contours, spatial frequencies, and spatial resolution) diagnostic of a particular spatial layout (Oliva and Torralba 2001; Torralba and Oliva, 2002; Ross and Oliva, 2010). This statistical representation of the spatial distribution of local image features is compressed relative to the original image. To visualize what information is contained in this spatial-envelope representation, sketch images are shown in Figure 14.3 below the original image, where random noise was coerced to have

Representing, perceiving, and remembering the shape of visual space

Figure 14.3 Top: examples of natural-scene images with different degrees of mean depth (from small to large volume). Bottom: a sketch representation of the visual features captured with the spatial-envelope representation (see Oliva and Torralba (2001) and (2006) for details). Note that this representation of a natural scene has no explicit coding of objects or segmented regions. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

the same statistical representation as the original image (see Oliva and Torralba (2006) for details). A summary of the framework of the spatial-envelope model is shown in Figure 14.4. For simplicity, the model is presented here as a combination of four global scene properties (Figure 14.4a). The implementation of the model takes the form of high-level image ﬁlters originating from the outputs of local oriented ﬁlters, as in the early visual areas of the brain (Figure 14.4b). Within this framework, the structure of a scene is characterized by the properties of the boundaries of the space (e.g., the size of the space, its degree of openness, and the perspective) and the properties of its content (e.g., the style of the surfaces, whether the scene is natural or artiﬁcial, the roughness of these surfaces, the level of clutter, and the type of materials). Any scene image can be described by the values it takes for each spatial-envelope property. These values can then be represented by terms that describe, for instance, the degree of openness of a given scene (“very open/panoramic,” “open,” “closed,” or “very closed/enclosed”; Oliva and Torralba, 2002). In this framework, instead of a forest being described as an environment with trees, bushes, and leaves, it would be described at an intermediate level as “a natural enclosed environment with a dense, isotropic texture.” Similarly, a speciﬁc image of a street scene could be described as an “artiﬁcial outdoor place with perspective and a medium level of

315

A. Oliva, S. Park, and T. Konkle Spatial-envelope properties Spatial Openness

Content

Expansion

Naturalness

Roughness

(a) Wi

Degree of naturalness

Degree of openness

Spatial-envelope representation

Openness

Expansion

Roughness

(b) Neighbors

Target Roughness

316

s

es

nn

pe

O Expansion

(c)

Figure 14.4 Schematic spatial-envelope model. (a) Spatial-envelope properties can be classiﬁed into spatial and content properties. (b) Illustration of a computational-neuroscience implementation of the spatial-envelope model. The features of naturalness and openness are illustrated here. (c) Projection pictures of artiﬁcial environments onto three spatial-envelope dimensions, creating a scene space (based on global properties only, no representation of objects here). Semantic categories (different colors) emerge, showing that the spatial-envelope representation carries information about the semantic class of a scene. Two target images, together with their nearest neighbors in the spatial-envelope space, are shown here (from a dense database of images). A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

Representing, perceiving, and remembering the shape of visual space clutter.” This level of description is meaningful to observers who can infer the probable semantic category of the scene. Indeed, Oliva and Torralba observed that scene images judged by people to have the same categorical membership (street, highway, forest, coastline, etc.) were projected close together in a multidimensional space whose axes correspond to the spatial-envelope dimensions (Figure 14.4c). Neighboring images in the spatial-envelope space correspond to images with a similar spatial layout and a similar semantic description (more so when the space is ﬁlled densely, i.e., either with a lot of varied exemplars or with typical exemplars of categories). As shown in these sections, both the isovist and the spatial-envelope representations provide many interesting and complementary descriptors of the shape of the space that are quantitatively deﬁned. The isovist describes the visible volumes of a three-dimensional space, while the spatial envelope captures layout and content features from a two-dimensional projected view. In these theories, space is a material entity as important as any other surface, such as wood, glass, or rock. Space has a shape with external and internal parts that can be represented by algorithms and quantitative measurements, some of which are very similar to operations likely to be implemented in the brain. These approaches constitute different instances of a space-centered understanding of the world, as opposed to an object-centered approach (Barnard and Forsyth, 2001; Carson et al., 2002; Marr, 1982). 14.3

Perceiving the shape of a space

Numerous studies have shown that our perception of space is not veridical: it can be distorted by a number of factors. Some factors are basic constraints arising from visual-ﬁeld resolution and the challenge of recovering the three-dimensional structure from a two-dimensional projection on the retinas. Other factors go beyond simple optics and include top-down effects of knowledge, as well as markers reﬂecting our physiological state. Finally, systematic distortions can arise as a consequence of perceptual dynamics as we adapt to the volumetric properties of the space around us. In this section, we will focus our review on distortions that change our global perception of the overall shape, volume, distances or slants of a space, rather than our local perception centered on objects or parts. 14.3.1

Distortion of the geometry of a space

There are many ways in which our perception of space is not veridical: for example, distances in the frontal plane (i.e., traversing from left to right) appear much larger than distances in the sagittal plane (i.e., receding in depth

317

318

A. Oliva, S. Park, and T. Konkle from the observer) (Wagner, 1985; Loomis, et al., 1992), while distances in the frontal plane appear much smaller than vertical distances (e.g., Higashiyama, 1996; Yang et al., 1999). Surface angles are often underestimated and slants of hills are often overestimated (Profﬁtt et al., 1995; Creem-Regehr et al., 2004). Distances to objects can be misperceived when a relatively wide expanse of the ground surface is not visible (Wu et al., 2004), or when the ﬁeld of view is too narrow (Fortenbaugh et al., 2007). Such biases are also highly dependent on the structure of the scene: distance judgments are most difﬁcult and inaccurate in a corridor; they are easy, accurate, and reliable out in an open ﬁeld; and they are easy and reliable but inaccurate in a lobby (Lappin et al., 2006). Many visual illusions, for example the Ames room, take advantage of different depth cues to change the perception of the size of objects and the size of a space. The rules for the distortion of the perception of physical space have been well documented (for a review, see Cutting (2003)): as physical distance increases, perceived distances are foreshortened as compared with physical space (Loomis and Philbeck, 1999). This means that observers do not accurately evaluate distances between objects at far distances, being only able to judge ordinal relations (which surface is in front of another, but not by how much). The compression of planes in a space with distance of viewing is likely to be due to the decrease in the available information and depth cues (Indow, 1991; see Cutting, 2003). In his 2003 review, Cutting reports three classes of ecological space perception ranges. First, perception in the personal space (up to about 2–3 m) is metric: indeed, veridical spatial computation is necessary for accurate hand reaching and grasping. In close-up space, distances to objects and surfaces are provided by many sources of information and depth cues, including accommodation and convergence, that cease to be effective beyond a few meters (Loomis et al., 1996). Second, the action space is deﬁned in practice by the distance to which one can throw an object accurately (up to about 30 m away). Whereas depth perception in the action space suffers some compression, studies found it to be close to physical space. Beyond a few tens of meters is vista space, where an observer’s perception of distances to and between surfaces can become greatly inaccurate, with a dramatically accelerating foreshortening of space perception for distances over 100 m. At that range, traditional pictorial cues of information are in effect (e.g., occlusion, relative size, aerial perspective, height in the visual ﬁeld, and relative density; see Cutting, 2003). Observers rely on their knowledge of the relative sizes of objects, and ordinal cues such as layout arrangement and occlusion, to infer the shape of the three-dimensional space. Furthermore, these drastic spatial compressions of vista space are not noticed by individuals (Cutting and Vishton, 1995; Cutting, 2003).

Representing, perceiving, and remembering the shape of visual space

(a)

(b)

(c)

(d)

Figure 14.5 Examples of natural images in which inverting the images creates a plausible scene view with dramatic changes in the interpretation of surfaces and volume between upright and upside down (adapted from Torralba and Oliva, 2003). A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

This gradient of the perceived compression of space suggests the need for a perceptual isovist, where the characteristics of the shape of visible surfaces are measured not from actual distances in a volume but from perceived distances (see Section 14.2.1). For example, we would expect a more deformed isovist for background than for foreground planes, since perception is based on ordinal estimations for the background planes. When only ordinal depth information about planes is available, some illusions of scene volume and misinterpretation of surfaces may occur. Figure 14.5 illustrates these illusions using photographs of natural scenes: the “mountain cliff” and the picture of a “river receding into the distance” (Figures 14.5a,c) are perceived as “the base of a mountain” and “a view looking up at the sky” (Figures 14.5b,d), respectively, when the images are inverted. Here, the image inversion has two main effects: it reverses lighting effects, which may change the surfaces’ afﬁliation as “object” and “ground,” and in some cases it produces large changes in the perceived scale of the space. The spatial-envelope approach (Oliva and Torralba, 2001; Torralba and Oliva, 2002) captures the low-level and texture statistics which are correlated with the change of perceived scale and semantics. 14.3.2

Changing the volume of a space

The tilt–shift illusion is another scene depth illusion where a small change in the levels of blur across an image can make an expansive scene look miniature (Figure 14.6). The degree of focus across a scene is a simple low-level depth cue: as you ﬁxate out at more distant points in space, the angle between your eyes narrows (accommodation), which inﬂuences the retinal blur gradient (Watt et al., 2005; Held and Banks, 2008). For example, focusing on an object very close in front of you will lead to a situation where only a small portion of

319

320

A. Oliva, S. Park, and T. Konkle

Figure 14.6 Two examples of the tilt–shift illusion, where adding a blur gradient to the upper and lower portions of an image makes the scene appear miniature. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

the image can be in focus, with the upper and lower parts of the scene blurred. Thus expansive scenes can be made to look small by adding blur. This effect works best for scenes which are taken from high above, mimicking the angle of view one would have if looking at a toy model. In other words, the tilt– shift effect works by changing the low-level statistics (the blur and angle of surfaces caused by an elevated head angle) to inﬂuence the perceived volume of a space. While the tilt–shift makes a large scene look small, the converse is also possible. Making small scenes look large is a trick that has been honed to an art by Hollywood special effects artists. The original special effects took advantage of two-dimensional projection rules, in a technique called “forced perspective.” For example, suppose cars are traveling across a bridge, with a camera ﬁlming the scene from afar. By putting a model version of the bridge much closer to the camera, the real bridge and the model bridge can be set to project to the same two-dimensional image, allowing a dramatic explosion of the model bridge to look real. These are examples of depth illusions, where the volume of the space changes based on the cues in the environment and on our expectations about the structure and statistics of the natural world. Indeed, neuroimaging work has shown that the size that you think something is in the world matters beyond just the visual angle at which it projects onto the retina. Murray et al. (2006) presented observers with two disks with matched visual angles on the screen, but with contextual information that made one disk look much larger (and farther away) than the other. The bigger disk activated a greater extent of the primary

Representing, perceiving, and remembering the shape of visual space visual cortex than the smaller disk did, despite their equivalent visual size. These results suggest that the perceived physical size of an object for space has consequences on very early stages of visual processing.

14.3.3

Changing the percept of a space: top-down inﬂuences

Distance estimation, like time, is modulated by individuals’ subjective perception: buying two gallons of milk instead of a carton of cream to carry back from the grocery store can make you feel that you are farther from home. Interestingly, work by Profﬁtt and collaborators (Profﬁtt et al., 2003) shows that nonoptical cognitive variables may inﬂuence the perception of space cues. Along with task constraints, the physical resources and capabilities available to an agent change the perception of space (e.g., distance and slant angle of ground surfaces). For example, participants wearing heavy backpacks thought that a target object on the ground was located further away from the starting point than did individuals who were not wearing backpacks (Profﬁtt et al., 2003). Importantly, such modulation of distance estimation occurred only when the participants intended to walk the distance (Witt et al., 2004). Similarly, the manipulation with a backpack load had no effect on people’s distance estimation when they were asked to throw a ball in the direction of the object. However, the weight of the ball changed the estimation of distance for participants who intended to throw the ball. In other words, only when the increased physical effort was directly related to the intended action did the estimation of the distance change (Witt et al., 2004; but see also Woods et al., 2009). Other studies have shown that the inherent characteristics or physical capabilities of an individual can also inﬂuence how they perceive space. For example, compared with younger people, older people with low physical capabilities tend to estimate a distance as longer or the same hill as steeper (Bhalla and Profﬁtt, 1999). When younger participants are primed with an elderly stereotype, they also have a tendency to overestimate distances (Twedt et al., 2009). The psychosocial state of an individual might also inﬂuence the perception of the space. Profﬁtt and colleagues found that participants who imagined positive social contact estimated the slant of a hill to be less than did participants who imagined neutral or negative social contact (Schnall et al., 2008). Although it is hard to conclude from these studies whether these nonoptical factors fundamentally changed the observers’ perceptions or whether they modulated responses without changing perception, they provide evidence that the experience of the geographical properties of a space can be inﬂuenced by changes in the psychological load of an observer, beyond the attributes of the physical world.

321

322

A. Oliva, S. Park, and T. Konkle 14.3.4

Adaptation to spatial layout

The previous sections have presented examples of how spatial low-level cues and preexisting top-down knowledge can inﬂuence our perception of the space that we are looking at, even if our physical view of the world stays the same. Similarly, temporal history can also inﬂuence the perception of a space: the experience that you had with particular visual scenes just moments ago can change your perception of the structure and depth of a scene that you are currently viewing. This is the phenomenon of adaptation: if observers are overexposed to certain visual features, adaptation to those features affects the conscious perception of a subsequently presented stimulus (e.g., this is classically demonstrated by adapting to a grating moving in one direction, where, afterwards, a static grating will appear to move in the opposite direction). Using an adaptation paradigm, Greene and Oliva (2010) tested whether observers adapt to global spatial-envelope properties (described in Section 14.2.2), such as mean depth and openness. In one study, observers were presented with a stream of natural scenes which were largely different (in terms of categories, colors, layout, etc.), but which were all exemplars of very open scenes, representing vista space (panorama views of ﬁelds, coastlines, deserts, beaches, mountains, etc.). Following this adaptation phase, a scene picture with a medium level of openness (e.g., a landscape with a background element) was presented for a short duration, and observers had to quickly decide whether this scene was very open or very closed. When observers were adapted to a stream of open scenes, ambiguous test images were more likely to be judged as closed. In contrast, the same ambiguous test images were judged to be open following adaptation to a stream of closed scenes (e.g., caves, forests, or canyons). Similar aftereffects occurred after observers had adapted to other extremes of spatialenvelope properties, such as small versus large depth and natural versus urban spaces, and even to higher-level properties of the scene, such as when the view depicted an environment with a hot versus a cold climate. Importantly, Greene and Oliva showed further that adaptation to different scene envelope properties not only inﬂuenced judgments of the corresponding scene properties for a new image but also inﬂuenced categorical judgments about a new image. This experiment took advantage of the fact that ﬁelds are usually open scenes, while forests are typically closed scenes. Importantly, there is a continuum between ﬁeld and forest scenes, with some scenes existing ambiguously between the two categories that can be perceived either as a ﬁeld or as a forest (see Figure 14.7). During the adaptation phase, observers were presented with a stream of images which again varied in their basic level category and their surface

Representing, perceiving, and remembering the shape of visual space

Field

Forest

Figure 14.7 Continuum between forests and ﬁelds. Images in the middle of the continuum have an ambiguous category and can be perceived as both a ﬁeld and as a forest. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

features, but all depicted open or closed views. No forests or ﬁelds were presented in this stream of images. After the observers had adapted to open scenes, an ambiguously open or closed image would be expected to appear to be more closed. The critical question was whether an ambiguous ﬁeld/forest image would also be more likely to be judged as a forest than as a ﬁeld, which has a more enclosed property. Similarly, adapting to a stream of closed natural images should cause the same ambiguous ﬁeld/forest to be more likely to be judged as a ﬁeld. Indeed, this is exactly what Greene and Oliva observed. These results demonstrate that exposure to a variety of scenes with a shared spatial property can inﬂuence the observers’ judgments of that spatial property later, and can even inﬂuence the semantic categorization of a scene. Such adaptation aftereffects have been shown for low-level features such as orientation and motion (Wade and Verstraten, 2005), and even for high-level features such as shape, face identity, and gender (Leopold et al., 2001; Webster et al., 2004). The adaptation mechanisms suggest that the neural system is tracking the statistics of the visual input and tuning its response properties to match. Thus, the aftereffects for global scene properties broadly imply that as observers process natural scenes, one of the extracted statistics to which the system is tuned reﬂects the layout and perceived volume of the scene. 14.4

Remembering the shape of a space

The previous section has reviewed evidence that the perception of space can be manipulated by low-level image cues, top-down inﬂuences, and the temporal history of scenes. These perceptual illusions occur online while the relevant sensory information is present in the world, but similar systematic distortions of space occur when we represent scene information that is no longer in view but is instead held in memory. In the following sections, we discuss how

323

324

A. Oliva, S. Park, and T. Konkle

(a)

(b)

Figure 14.8 Example of boundary extension. After viewing a close-up view of a scene (a), observers tend to report an extended representation (b). A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/ 978107001756).

single views are remembered and how this effect might be understood in the framework of navigation through a space. 14.4.1

Behavioral and neural aspects of boundary extension

When presented with a scene view, what do observers remember about the space depicted? Intraub and Richardson (1989) presented observers with pictures of scenes, and found that when the observers drew the scenes from memory, they systematically drew more of the space than was actually shown: this is the phenomenon of boundary extension. Since this initial demonstration, much research has been done showing the generality of this effect. For example, boundary extension is robust to various tasks beyond drawing, such as rating and border adjustment (e.g., Intraub et al., 1992, 2006), and to different image sets (Candel et al., 2003; Daniels and Intraub, 2006); it operates over a range of timescales from minutes to hours (Intraub and Dickinson, 2008); and it is found in both young children and older adults (Candel et al., 2004; Seamon et al., 2002). Interestingly, boundary extension occurs even when observers are blindfolded – they explore space with their hands – suggesting an important link between the representations of space across sensory modalities (Intraub, 2004). Figure 14.8 shows an example of boundary extension. Observers presented with the scene in Figure 14.8a will remember the scene as having more information around the edges, as depicted in Figure 14.8b. In a functional neuroimaging study, Park et al. (2007) examined whether scene-selective neural regions showed evidence of representing more space than the original scene view. Critically, they used a neural adaptation paradigm (also

Representing, perceiving, and remembering the shape of visual space called repetition attenuation; Grill-Spector et al., 2005) to determine what scene information was being represented. In an adaptation paradigm, when a stimulus is repeated, the amount of neural activity is reduced when processed for the second time compared with when it was processed as a novel stimulus. This logic suggests that a second presentation of the stimulus matches what was previously presented, thereby facilitating visual processing and reducing neural activity. Park and colleagues used the phenomenon of neural adaptation to examine whether the brain’s sensitivity to scene views was consistent with predictions derived from the phenomenon of boundary extension. When an observer is presented with a close scene view, the existence of boundary extension predicts that this scene view might be represented at a wider angle than that at which it was originally presented. Thus, if the second stimulus is presented slightly wider than the original, this should match the representation in scene-selective areas and show a large degree of attenuation. Conversely, if the order of these stimuli is reversed, the representation of the wide-angle view will be very different from that of a subsequently presented close view, and thus no neural attenuation is expected. This is precisely the pattern of results that Park et al. (2007) observed in the parahippocampal place area, as shown in Figure 14.9. 14.4.2

Navigating to remembered scene views

While boundary extension can be interpreted as an extrapolation of information in the periphery of a scene (requiring no movement of the observer), this effect can also be examined within a three-dimensional environment. Here we explore the notion of a prototypical view and examine whether the memory of a view from a speciﬁc location might be inﬂuenced by a prototypical view. In general, a view of a scene arises from an observer’s location in a threedimensional space. As the observer walks through an environment, the view gives rise to a scene gist (e.g., a forest) that changes slowly as the observer walks forward (e.g., a house view, followed by a view of a foyer, then a corridor, and then a bedroom). In other words, different views may take on new semantic interpretations at different spatial scales (Oliva and Torralba, 2001; Torralba and Oliva, 2002) (Figure 14.10). However, there also are many views with the same scene gist (e.g., a bedroom), which remain consistent whether the observer walks a few steps backwards or a few steps forward. Given these different views of a scene, is there a prototypical location within a volume that gives rise to a consistently preferred scene view? Konkle and Oliva (2007a) examined this question by placing observers at either the front or the back of a “virtual room” and had them maneuver forward

325

A. Oliva, S. Park, and T. Konkle Close – wide

Wide – close

Initial

Repeated

PPA fMRI signal change (%)

0.5 n.s. 0.4 0.3 0.2 0.1 0

LOC 0.5 fMRI signal change (%)

326

0.4 0.3 0.2 0.1 0

Figure 14.9 Examples of close–wide and wide–close conditions are presented in the top row. The peaks of the hemodynamic responses for close–wide and wide–close conditions are shown for the PPA and LOC. An interaction between the activations for the close–wide and wide–close conditions representing boundary extension asymmetry was observed in the PPA but not in the LOC. The error bars indicate the standard error (± s.e.m.). Figure adapted from Park et al. (2007). A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/ 978107001756).

Representing, perceiving, and remembering the shape of visual space

Figure 14.10 The semantic meaning of a scene changes as the depth of the scene increases. From a large distance, an observer may view buildings, which as the observer approaches change to rooms or to singleton objects on a surface. Figure adapted from Torralba and Oliva (2002).

or backward through the space until they had the best view. We did not deﬁne what the “best view” was for observers, but provided people with instructions reminiscent of the story of Goldilocks and the three bears: “this very close view is too close, and this very far view is too far, so somewhere in between is a view that is just right.” In all our rooms, the three-dimensional space was constructed so that all locations and views had the same semantic gist of the scene. Two places are shown in Figure 14.11, which shows the closest possible view (left), the farthest possible view (right), and the preferred view across all observers (middle). Despite the subjectivity of the task, the observers were relatively consistent in their preferred views, and most used a consistent navigation strategy in which they moved all the way to the back of the scene, for example “I zoomed out to see what type of space it was,” and then walked forward “until I felt comfortable / until it looked right.” A few observers commented that to get the best view they wanted to step either left or right, which was not allowed in the experimental design. The data suggest that, given a scene, some views are indeed better than others, and observers have a sense of how to walk to get the best view. In geometric terms, this notion that there is a prototypical view implies that there is a particular preferred viewing location in 3D space. Konkle and Oliva (2007a) next tested memory for scene views along the walking path (from the entrance view to the close-up view). Observers studied particular scene views for each of the rooms, where some views were close up and others were wide angle, deﬁned relative to the preferred view. To test memory for these scene views, the observers were placed in the room at either the back or the front of the space and had to maneuver through the space

327

328

A. Oliva, S. Park, and T. Konkle Closest view

Preferred view

Farthest view

Figure 14.11 Two example spaces, a kitchen (top) and a living room (bottom). The closest and farthest views of the scenes are shown (left and right, respectively), as well as the preferred scene view across observers (middle). A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

to match where they stood during the study phase (for a similar method, see Konkle and Oliva, 2007b). The results showed that, for the close-up views, observers tended to navigate to a position farther back in the scene, showing boundary extension. For far views, the opposite pattern was observed, where people tended to navigate to a closer location than the view studied (Figure 14.12). Thus, in this experimental task with these scenes, we observed boundary extension for close views and boundary restriction for far views. Importantly, the memory errors were not driven by a few large errors (e.g., as if observers sometimes selected a very far scene rather than a very close one), but instead reﬂect small shifts of one or two virtual steps. While boundary restriction is not often observed, one possible explanation for why we observed both boundary extension and boundary restriction is that the close and far views used here cover a large range of space (the “action space”; see Section 14.4.2). These systematic biases in memory can be explained by the notion that memory for a scene is reconstructive (e.g., Bartlett, 1932). According to this idea, when an observer has to navigate to match a scene view in memory, if they have any uncertainty about the location, then they will not guess randomly from among the options, but instead will choose a view that is closer to the

Number of steps Extension Compression

Representing, perceiving, and remembering the shape of visual space 4 3 2 1 0 –1 –2 –3 –4

Studied

Prototype

Studied

2.2

Remembered

Remembered –3.1 Too close

Too far

(a)

(b)

Figure 14.12 (a) Memory errors for scenes presented too close or too far, measured by number of steps. The error bars reﬂect ±1 s.e.m. (b) Example scene used in the memory experiment. Observers were presented with either a too-close or a too-far view. The remembered scene views were systematically biased towards the prototypical view. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

prototypical view. This way, the memory for a particular scene representation can take advantage of the regularities observed in other scenes of that semantic category and spatial layout to support a more robust memory trace. While this strategy will lead to small systematic memory errors towards the prototypical view, it is actually an optimal strategy to improve memory accuracy overall (Huttenlocher et al., 2000; Hemmer and Steyvers, 2009). Currently, there is still much to understand about what aspects of natural visual scenes determine the magnitude and direction of memory errors. For example, some work suggests that boundary extension errors can depend on the identity and size of the central object (e.g., Bertamini et al., 2005) and on the complexity of the background scene (e.g., Gallagher et al., 2005). The relative contributions of structural features of the space and semantic features of the scene in these effects are unknown, and we cannot yet take an arbitrary scene and understand what spatial distortions will be present in memory. We believe that such predictions will become possible as we gain a richer and more quantitative vocabulary that characterizes the many possible spatial relationships between an observer and the elements in front of him or her, as well as the spatial structure of the three-dimensional space. Finally, in all of these experiments, observers have to remember a scene view that is presented at a visual angle subtending 5◦ to 20◦ . However, in natural viewing conditions, observers navigate through the environment with a full-ﬁeld

329

330

A. Oliva, S. Park, and T. Konkle

Figure 14.13 Scene view presented in a full-ﬁeld display. Observers were seated at the table, with their head position ﬁxed by the chin rest. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

view. While previous studies have demonstrated that the shape of the aperture (rectilinear, oval, or irregular) does not affect the magnitude of boundary extension (Daniels and Intraub, 2006), one important question is whether these memory biases necessarily depend on a restricted view of a scene relative to our whole visual ﬁeld. To test this, we had observers complete the same task with a full-ﬁeld display (Figure 14.13; T. Konkle, M. Boucart, and A. Oliva, unpublished data). We found that the observers showed similar memory errors, with boundary extension in memory for close-up views, boundary compression in memory for far-away views, and no systematic bias for prototypical scene views. The data from this panoramic study demonstrate that these scene memory mechanisms discovered using pictorial scenes presented on a monitor operate even on scenes presented to the full visual ﬁeld. Overall, these data support the notion that prototypical views may serve as an anchor for the memory of a speciﬁc view, and that scene-processing mechanisms may serve not only to help construct a continuous world, but also to support optimal views for perception and memory of a three-dimensional space.

14.5

From views to volume: integrating space

People experience space in a variety of ways, sometimes viewing the scene through an aperture but sometimes becoming immersed in an environment that extends beyond what can be perceived in a single view. Numerous

Representing, perceiving, and remembering the shape of visual space

View 1

View 3

Figure 14.14 The ﬁrst, second, and third views from a single panoramic scene. Views 1, 2, and 3 were sequentially presented one at a time at ﬁxation. The ﬁrst and the third view overlapped in 33% of their physical details. A color version of this ﬁgure can be found on the publisher’s website (www.cambridge.org/978107001756).

studies have shown that the brain makes predictions about what may exist in the world beyond the aperture-like visual input by using visual associations or context (Bar, 2004; Chun, 2000; Palmer, 1975, among others), by combining the current scene with recent experience in perceptual or short-term memory (Irwin et al., 1990; Lyle and Johnson, 2006; Miller and Gazzaniga, 1998; Oliva et al., 2004), and by extrapolating scene boundaries (Intraub and Richardson, 1989; Hochberg, 1986) (see Section 14.4). These predictions and extrapolations help build a coherent percept of the world (Hochberg, 1978, 1986; Kanizsa and Gerbino, 1982). Park and Chun (2009) tested whether the brain holds an explicit neural representation of a place beyond the scene in view. In an fMRI scanner, participants were presented with three consecutive, overlapping views from a single panoramic scene (Figure 14.14), so that the observers perceived a natural scan of the environment, as if moving their head from left to right. The researchers investigated whether brain regions known to respond preferentially to pictures of natural scenes and spaces also show sensitivity to views that are integrated into a coherent panorama. Park and Chun (2009) found that the parahippocampal place area (PPA), an area known to represent scenes and spatial-layout properties (Epstein and Kanwisher, 1998; Park et al., 2011), has a view-speciﬁc representation (see also Epstein et al., 2003): the PPA treated each view of the panoramic scene as a different “scene.” In contrast, the retrosplenial cortex (RSC), an area implicated in navigation and route learning in humans and rodents (Burgess et al., 2001; Aguirre and D’Esposito, 1999; see also Vann et al., 2009 for a review) exhibited view-invariant representation: the RSC treated all three different

331

332

A. Oliva, S. Park, and T. Konkle views as a single continuous place, as expressed by neural attenuation from view 1 to view 3. Additional experiments suggested that the RSC showed such neural attenuation only when the views were displayed in close spatiotemporal continuity. When the same trials were presented with a longer lag or with intervening items between views, the RSC no longer showed neural attenuation and responded highly to each view as if it was a novel scene. In summary, the PPA and RSC appear to complement each other by representing both view-speciﬁc and view-invariant information from scenes in a place. While Park and Chun (2009) tested the extrapolation of views at a local level (e.g., by scanning the world through simulated head and eye movements while the viewer’s location was constant), Epstein et al., (2007) tested the neural basis of the extrapolation of views to a larger volume, beyond the viewer’s current location. In their study, Epstein et al. presented participants from the University of Pennsylvania community with views of familiar places around the campus or views from a different, unfamiliar campus. The participants’ tasks were to judge the location of each view (e.g., whether on the west or east of 36th Street) or its orientation (e.g., whether it was facing to the west or east of the campus). Whereas the PPA responded equally to all conditions, the RSC activation was strongest for location judgments. This task required information about the viewer’s current location, as well as the location of the current scene within the larger environment. The RSC activation was second highest when viewers were making orientation judgments, which required information about the viewer’s location and head direction, but not the location of the current scene relative to the environment. The RSC responded less highly in the familiar condition and the least in the unfamiliar condition. These graded modulations of RSC activity suggest that this region is strongly involved in the retrieval of longterm spatial knowledge, such as the location of a viewer within a scene, and the location of a scene within a bigger environment. The involvement of the RSC in the retrieval of long-term memory is consistent with patient and neuroimaging studies that have shown the involvement of the RSC in episodic and autobiographic memory retrieval (Burgess et al., 2001; Maguire, 2001; Byrne et al., 2007). In a related vein, several spatial navigation studies suggest that people can use geometric environmental cues such as landmarks (Burgess, 2006; McNamara et al., 2003) or alignments with respect to walls to recognize a novel view of the same place as fast as a learned view, suggesting that people represent places or environments beyond the visual input. Interestingly, the modern world is full of spatial leaps and categorical continuity ruptures between scenes that violate the expectations we have about the geometrical relationship between

Representing, perceiving, and remembering the shape of visual space places within a given environment. For instance, subways act like “wormholes” (Rothman and Warren, 2006; Schnapp and Warren, 2007; Ericson and Warren, 2009), distorting the perception of the spatiotemporal relationships between the locations of places in a geometrical map. Warren and colleagues tested how people behave in such “rips” and “folds” using a maze in a virtual reality world. When participants were asked to walk between two objects at different locations in a maze, they naturally took advantage of wormhole shortcuts and avoided going around a longer path. The observers did not notice that the wormholes violated the Euclidean structure of the geometrical map of the maze. These results demonstrate that the spatial knowledge about a broad environment does not exist as a complete integrated cognitive map per se, but instead exists as a combination of local neighborhood directions and distances embedded in a weak topological structure of the world. Altogether, human spatial perception is not restricted to the current view of an aperture, but expands to the broader environmental space by representing multiple continuous views as a single integrated place, and linking the current view with long-term spatial knowledge. At the neural level, the PPA and RSC facilitate a coherent perception of the world, with the PPA representing the speciﬁc local geometry of the space and the RSC integrating multiple snapshots of views using spatiotemporal continuity and long-term memory. 14.6

Conclusions

Perceiving the geometry of space in our three-dimensional world is essential for navigating and interacting with objects. In this chapter, we have offered a review of key work in the behavioral, computational, and cognitiveneuroscience domains that has formalized space as an entity on its own. Space itself can be considered an “object of study,” whose fundamental structure is composed of structural and semantic properties. We have shown that perception of the shape of a space is modulated by low-level image cues, top-down inﬂuences, stored knowledge, and spatial and temporal history. Like an object, a space has a function, a purpose, a typical view, and a geometrical shape. The shape of a space is an entity that, like the shape of an object or a face, can be described by its contours and surface properties. Furthermore, the perception of space is sensitive to task constraints and experience and subject to visual illusions and distortions in short-term and long-term memory. Lastly, evidence suggests that dedicated neural substrates encode the shape of space. Although the notion of studying space’s “shape” may seem unorthodox, consider that, as moving agents, what we learn about the world occurs within a structured geometric volume of space.

333

334

A. Oliva, S. Park, and T. Konkle Acknowledgments The authors wish to thank Barbara Hidalgo-Sotelo and Madison Capps for insightful comments on the manuscript. This work was funded by National Science Foundation grants 0705677 and 0546262 to A. O. The author T. K. was funded by a National Science Foundation Graduate Research fellowship. All of the authors contributed equally to this chapter. References Aguirre G. K. and D’Esposito M. (1999). Topographical disorientation: a synthesis and taxonomy. Brain, 122: 1613–1628. Bar, M. (2004). Visual objects in context. Nature Rev. Neurosci., 5: 617–629. Barnard, K. and Forsyth, D. A. (2001). Learning the semantics of words and pictures. In Proceedings of the International Conference on Computer Vision, Vancouver, Vol. II, pp. 408–415. Bartlett, F. C. (1932). Remembering: A Study in Experimental and Social Psychology. Cambridge: Cambridge University Press. Benedikt, M. L. (1979). To take hold of space: isovists and isovist ﬁelds. Environ. Plan. B, 6: 47–65. Benedikt, M. L. and Burnham, C. A. (1985). Perceiving architectural space: from optic arrays to isovists. In W. H. Warren and R. E. Shaw (eds.), Persistence and Change, pp. 103–114. Hillsdale, NJ: Lawrence Erlbaum. Bertamini, M., Jones L., Spooner, A., and Hecht, H. (2005). Boundary extension: the role of magniﬁcation, object size, context and binocular information. J. Exp. Psychol.: Hum. Percept. Perf., 31: 1288–1307. Bhalla, M. and Profﬁtt, D. R. (1999). Visual-motor recalibration in geographical slant perception. J. Exp. Psychol.: Hum. Percept. Perf., 25: 1076–1096. Burgess, N. (2006). Spatial memory: how egocentric and allocentric combine. Trends Cogn. Sci., 10: 551–557. Burgess, N., Becker, S., King, J. A., and O’Keefe, J. (2001). Memory for events and their spatial context: models and experiments. Phil. Trans. R. Soc. Lond. B, 356: 1493–1503. Byrne, P., Becker, S., and Burgess, N. (2007). Remembering the past and imagining the future: a neural model of spatial memory and imagery. Psychol. Rev., 114: 340–375. Candel, I., Merckelbach, H., and Zandbergen, M. (2003). Boundary distortions for neutral and emotional pictures. Psychonom. Bull. Rev., 10: 691–695. Candel, I., Merckelbach, H., Houben, K., and Vandyck, I. (2004). How children remember neutral and emotional pictures: boundary extension in children’s scene memories. Am. J. Psychol., 117: 249–257. Carson, C., Belongie, S., Greenspan, H., and Malik, J. (2002). Blobworld: image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pat. Anal. Machine Intell., 24: 1026–1038

Representing, perceiving, and remembering the shape of visual space Chun, M. M. (2000). Contextual cueing of visual attention. Trends Cogn. Sci., 4: 170–178. Creem-Regehr, S. H., Gooch, A. A., Sahm, C. S., and Thompson, W. B. (2004). Perceiving virtual geographical slant: action inﬂuences perception. J. Exp. Psychol.: Hum. Percept. Perf., 30: 811–821. Cutting, J. E. (2003). Reconceiving perceptual space. In H. Hecht, M. Atherton, and R. Schwartz (eds.), Perceiving Pictures: An Interdisciplinary Approach to Pictorial Space, pp. 215–238. Boston, MA: MIT Press. Cutting, J. E. and Vishton, P. M. (1995). Perceiving layout and knowing distances: the integration, relative potency, and contextual use of different information about depth. In W. Epstein and S. Rogers (eds.), Perception of Space and Motion, pp. 69–117, Vol. 5 of Handbook of Perception and Cognition. San Diego, CA: Academic Press. Daniels, K. K. and Intraub, H. (2006). The shape of a view: are rectilinear views necessary to elicit boundary extension? Vis. Cogn., 14: 129–149. Davis, L. S. and Benedikt, M. L. (1979). Computational Model of Space: Isovists and Isovists Fields. Technical Report, School of Architecture, the University of Texas at Austin. Epstein, R. A. (2008). Parahippocampal and retrosplenial contributions to human spatial navigation. Trends Cogn. Sci., 12: 388–396. Epstein, R. and Kanwisher, N. (1998). A cortical representation of the local visual environment. Nature, 392: 598–601. Epstein, R., Graham, K. S., and Downing, P. E. (2003). Viewpoint-speciﬁc scene representations in human parahippocampal cortex. Neuron, 37: 865–876. Epstein, R. A., Parker, W. E., and Feiler, A. M. (2007). Where am I now? Distinct roles for parahippocampal and retrosplenial cortices in place recognition. J. Neurosci., 27: 6141–6149. Ericson, J. and Warren, W. (2009). Rips and folds in virtual space: ordinal violations in human spatial knowledge. [Abstract.] J. Vis., 9(8): 1143. Fortenbaugh, F. C., Hicks, J. C., Hao, L., and Turano, K. A. (2007). Losing sight of the bigger picture: peripheral ﬁeld loss compresses representations of space. Vis. Res., 47: 2506–2520. Friedman, A. (1979). Framing pictures: the role of knowledge in automatized encoding and memory for gist. J. Exp. Psychol.: Gen., 108: 316–355. Gallagher, K., Balas, B., Matheny, J., and Sinha, P. (2005) The effects of scene category and content on boundary extension. In B. Bara, L. Barsalou, and M. Bucciarelli (eds.), Proceedings of the 27th Annual Meeting of the Cognitive Science Society. Stresa, Italy: Cognitive Science Society. Greene, M. R. and Oliva, A. (2010). High-level aftereffects to global scene property. J. Exp. Psychol.: Hum. Percept. Perf., 36(6): 1430–1442. Grill-Spector, K., Henson, R., and Martin, A. (2005). Repetition and the brain: neural models of stimulus-speciﬁc effects. Trends Cogn. Sci., 10: 14–23. Held, R. and Banks, M. (2008). Perceived size is affected by blur and accommodation. [Abstract.] J. Vis., 8(6): 442.

335

336

A. Oliva, S. Park, and T. Konkle Hemmer, P. and Steyvers, M. (2009). A Bayesian account of reconstructive memory. Topics Cogn. Sci., 1: 189–202. Higashiyama, A. (1996). Horizontal and vertical distance perception: the discorded-orientation theory. Percept. Psychophys., 58: 259–270. Hochberg, J. (1978). Perception, 2nd edn. New York: Prentice-Hall. Hochberg, J. (1986). Representation of motion and space in video and cinematic displays. In K. J. Boff, L. Kaufman, and J. P. Thomas (eds.), Handbook of Perception and Human Performance, Vol. 1, pp. 22:1–22:64. New York: Wiley. Hollingworth, A. and Henderson, J. M. (2004). Sustained change blindness to incremental scene rotation: a dissociation between explicit change detection and visual memory. Percept. Psychophys., 66: 800–807. Huttenlocher, J., Hedges, L. V., and Vevea, J. L. (2000). Why do categories affect stimulus judgment? J. Exp. Psychol. Gen., 129: 220–241. Indow, T. (1991). A critical review of Luneburg’s model with regard to global structure of visual space. Psychol. Rev., 98: 430–453. Intraub, H. (2004). Anticipatory spatial representation in a deaf and blind observer. Cognition, 94: 19–37. Intraub, H. and Dickinson, C. A. (2008). False memory 1/20th of a second later: what the early onset of boundary extension reveals about perception. Psychol. Sci., 19: 1007–1014. Intraub, H. and Richardson, M. (1989). Wide-angle memories of close-up scenes. J. Exp. Psychol.: Learn., Mem. Cogn., 15: 179–187. Intraub, H., Bender, R. S., and Mangels, J. A. (1992). Looking at pictures but remembering scenes. J. Exp. Psychol.: Learn., Mem. Cogn., 18: 180–191. Intraub, H., Hoffman, J. E., Wetherhold, C. J., and Stoehs, S. (2006). More than meets the eye: the effect of planned ﬁxations on scene representation. Percept. Psychophys., 5: 759–769. Irwin, D. E., Zacks, J. L., and Brown, J. S. (1990). Visual memory and the perception of a stable visual environment. Percept. Psychophys., 471: 35–46. Kanizsa, G. and Gerbino, W. (1982). Amodal completion: Seeing or thinking? In J. Beck (ed.), Organization and Representation in Perception, pp. 167–190. Hillsdale, NJ: Erlbaum. Koenderink, J. J., van Doorn, A. J., and Todd, J. T. (2009). Wide distribution of external local sign in the normal population. Psychol. Res., 73: 14–22. Konkle, T. and Oliva, A. (2007a). Normative representation of objects and scenes: evidence from predictable biases in visual perception and memory. [Abstract.] J. Vis., 7(9): 1049. Konkle, T. and Oliva, A. (2007b). Normative representation of objects: evidence for an ecological bias in perception and memory. In D. S. McNamara and J. G. Trafton (eds.), Proceedings of the 29th Annual Cognitive Science Society, pp. 407–413. Austin, TX: Cognitive Science Society. Lappin, J. S., Shelton, A. L. and Rieser, J. J. (2006). Environmental context inﬂuences visually perceived distance. Percept. Psychophys., 68: 571–581.

Representing, perceiving, and remembering the shape of visual space Leopold, D., O’Toole, A., Vetter, T., and Blanz, V. (2001). Prototype-referenced shape encoding revealed by high-level aftereffects. Nature Neurosci., 4: 89–94. Loomis, J. and Philbeck, J. (1999). Is the anisotropy of perceived 3D shape invariant across scale? Percept. Psychophys., 61: 397–402. Loomis, J. M., Da Silva, A., Fujita, N., and Fukusima, S. S. (1992). Visual space perception and visual directed action. J. Exp. Psychol.: Hum. Percept. Perf., 18: 906–921. Loomis, J. M., Da Silva, J. A., Philbeck, J. W., and Fukushima, S. S. (1996). Visual perception of location and distance. Curr. Dir. Psychol. Sci., 5: 72–77. Lyle, K. B. and Johnson, M. K. (2006). Importing perceived features into false memories. Memory, 14: 197–213. Maguire, E. A. (2001). The retrosplenial contribution to human navigation: a review of lesion and neuroimaging ﬁndings. Scand. J. Psychol., 42: 225–238. Marr, D. (1982). Vision. San Francisco, CA: W. H. Freeman. McNamara, T. P., Rump, B., and Werner, S. (2003). Egocentric and geocentric frames of reference in memory of large-scale space. Psychonom. Bull. Rev., 10: 589–595. Michel, L. (1996). Light the Shape of Space. Designing with Space and Light. New York: Van Nostrand Reinhold. Miller, M. B. and Gazzaniga, M. S. (1998). Creating false memories for visual scenes. Neuropsychologia, 46: 513–520. Murray, S. O., Boyaci, H., and Kersten, D. (2006). The representation of perceived angular size in human primary visual cortex. Nature Neurosci., 9: 429–434. O’Keefe, J. A. (1979). A review of hippocampal place cells. Prog. Neurobiol., 13: 419–439. Oliva, A. (2005). Gist of the scene. In L. Itti, G. Rees, and J. K. Tsotsos (eds.), The Encyclopedia of Neurobiology of Attention, pp. 251–256. San Diego, CA: Elsevier. Oliva, A. and Torralba, A. (2001). Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comp. Vis., 42: 145–175. Oliva, A. and Torralba, A. (2002). Scene-centered description from spatial envelope properties. In H. H. Bülthoff, S.-W. Lee, T. Poggio, and C. Wallraven (eds.), Biologically Motivated Computer Vision, pp. 263–272. Lecture Notes in Computer Science, 2525. Berlin: Springer. Oliva, A. and Torralba, A. (2006). Building the gist of a scene: the role of global image features in recognition. Prog. Brain Res.: Vis. Percept., 155: 23–36. Oliva, A. and Torralba, A. (2007). The role of context in object recognition. Trends Cogn. Sci., 11: 520–527. Oliva, A., Arsenio, H. C., and Wolfe, J. M. (2004). Panoramic search: the interaction of memory and vision in search through a familiar scene. J. Exp. Psychol.: Hum. Percept. Perf., 30: 1132–1146. Palmer, S. E. (1975). The effects of contextual scenes on the identiﬁcation of objects. Mem. Cogn. 3: 519–526. Park, S. and Chun, M. M. (2009). Different roles of the parahippocampal place area (PPA) and retrosplenial cortex (RSC) in panoramic scene perception. Neuroimage, 47: 1747–1756.

337

338

A. Oliva, S. Park, and T. Konkle Park, S., Intraub, H., Widders, D., Yi, D. J., and Chun, M. M. (2007). Beyond the edges of a view: boundary extension in human scene-selective visual cortex. Neuron, 54: 335–342. Park, S., Brady, T. F., Greene, M. R., and Oliva, A. (2011). Disentangling scene content from spatial boundary: complementary roles for the parahippocampal place area and lateral occipital complex in representing real-world scenes. J. Neurosci., 31(4): 1333–1340. Pirenne, M. H. (1970). Optics, Painting and Photography. Cambridge: Cambridge University Press. Potter, M. C. (1976). Short-term conceptual memory for pictures. J. Exp. Psychol.: Hum. Learn. Mem., 2: 509–522. Profﬁtt, D. R., Bhalla, M., Gossweiler, R., and Midgett, J. (1995). Perceiving geographical slant. Psychonom. Bull. Rev., 2: 409–428. Profﬁtt, D. R., Stefanucci, J., Banton, T., and Epstein, W. (2003). The role of effort in perceiving distance. Psychol. Sci., 14: 106–112. Psarra, S. and Grajewski, T. (2001). Describing shape and shape complexity using local properties. In J. Peponis, J. Wineman and S. Bafna (eds.), Proceedings of the 3rd International Space Syntax Symposium, pp. 28.1–28.16. Ann Arbor, MI: Alfred Taubman College of Architecture and Urban Planning, University of Michigan. Ross, M. G. and Oliva, A. (2010). Estimating perception of scene layout properties from global image features. J. Vis., 10(1): 2. Rothman, D. B. and Warren, W. H. (2006). Wormholes in virtual reality and the geometry of cognitive maps. [Abstract.] J. Vis., 6(6): 143. Schnall, S., Harber, K. D., Stefanucci, J. K., and Profﬁtt, D. R. (2008). Social support and the perception of geographical slant. J. Exp. Social Psychol., 44: 1246–1255. Schnapp, B. and Warren, W. (2007). Wormholes in virtual reality: what spatial knowledge is learned for navigation? [Abstract.] J. Vis., 7(9): 758. Seamon, J. G., Schlegel, S. E., Hiester, P. M., Landau, S. M., and Blumenthal, B. F. (2002). Misremembering pictured objects: people of all ages demonstrate the boundary extension illusion. Am. J. Psychol., 115: 151–167. Stamps, A. E. (2005). Enclosure and safety in urbanscapes. Environ. Behav., 37: 102–133. Tandy, C. R. V. (1967). The isovist method of landscape survey. In H. C. Murray (ed.), Symposium: Methods of Landscape Analysis, pp. 9–10. London: Landscape Research Group. Torralba, A. and Oliva, A. (2002). Depth estimation from image structure. IEEE Pattern Anal. Machine Intell., 24: 1226–1238. Torralba, A. and Oliva, A. (2003). Statistics of natural images categories. Network: Comput. Neural Syst., 14: 391–412. Turner, A., Doxa, M., O’Sullivan, D., and Penn, A. (2001). From isovists to visibility graphs: a methodology for the analysis of architectural space. Environ. Plan. B: Plan. Des., 28: 103–121. Twedt, E., Hawkins, C. B., and Profﬁtt, D. (2009). Perspective-taking changes perceived spatial layout. [Abstract.] J. Vis., 9(8): 74.

Representing, perceiving, and remembering the shape of visual space Vann, S. D., Aggleton, J. P., and Maguire E. A. (2009). What does the retrosplenial cortex do? Nature Rev. Neurosci., 10: 792–802. Wade, N. J. and Verstraten, F. A. J. (2005). Accommodating the past: a selective history of adaptation. In C. Clifford and G. Rhodes (eds.), Fitting the Mind to the World: Adaptation and Aftereffects in High-Level Vision. Advances in Visual Cognition Series. Oxford: Oxford University Press. Wagner, M. (1985). The metric of visual space. Percept. Psychophys., 38: 483–495. Watt, S. J., Akeley, K., Ernst, M. O., and Banks, M. S. (2005). Focus cues affect perceived depth. J. Vis., 5: 834–862. Webster, M. A., Kaping, D., Mizokami, Y., and Duhamel, P. (2004). Adaptation to natural facial categories. Nature, 428: 557–561. Wiener, J. M. and Franz, G. (2005). Isovists as a means to predict spatial experience and behavior. In C. Freksa, M. Knauff, B. Krieg-Brückner, B. Nebel, and T. Barkowsky (eds.), International Conference on Spatial Cognition 2004, pp. 42–57. Lecture Notes in Computer Science, 3343. Berlin: Springer. Witt, J. K., Profﬁtt, D. R., and Epstein, W. (2004). Perceiving distance: a role of effort and intent. Perception, 33: 570–590. Woods, A. J., Philbeck, J. W., and Danoff, J. V. (2009). The various perceptions of distance: an alternative view of how effort affects distance judgments. J. Exp. Psychol.: Hum. Percept. Perf., 35: 1104–1117. Wu, B., Ooi, T. L., and Zijiang, J. H. (2004). Perceiving distance accurately by a directional process of integrating ground information. Nature, 428: 73–77. Yang, T. L., Dixon, M. W., and Profﬁtt, D. R. (1999). Seeing big things: overestimates of heights is greater for real objects than for objects in pictures. Perception, 28: 445–467.

339

Author index

Abbott, L. F., 268, 274

Ariely, D., 135

Beck, J., 285, 305

Adelson, E. H., 16, 34, 39, 44,

Arsenio, H. C., 308, 331, 337

Becker, S., 332, 334

Assee, A., 11, 13, 40, 49, 68

Belhumeur, P., 146, 156, 282,

280, 282, 303–305 Agostini, T., 305

Aubert, H., 226

Aguirre, G. K., 259, 261, 264,

Avillac, M., 194, 203

270, 274, 275, 278, 332, 334 Ahumada, A. J., 16, 34, 45 Akeley, K., 116, 124, 125, 133, 134, 136, 339 Albrecht, D. G., 44 Allison, R. S., 164–166, 180, 183–186 Alonso, J. M., 20, 40 Alvarez, T. L., 227 Andersen, R. A., 11, 20, 34,

304 Bellebaum, C., 194, 203 Belongie, S., 334

Backus, B. T., 23, 40, 115, 131, 133 Bai, X., 157 Baker, J. T., 190, 195, 197, 203 Balas, B., 335 Baldi, P., 231, 232, 250 Balslev, D., 194, 203 Banks, M., 23, 40, 115, 116,

Ben-Shahar, O., 155, 156 Bender, R. S., 258, 277, 324, 336 Benedikt, M. L., 310–313, 334, 335 Benson, R. R., 277 Bergen, J. R., 16, 34, 39, 280, 304

124, 133–136, 205, 214,

Berger, D. R., 205

226, 320, 335, 339

Berkeley, G., 115, 133

36–38, 40, 43, 44, 191,

Banton, T., 338

Bernard, M. R., 158

192, 194, 203, 204

Bar, M., 275, 331, 334

Bertamini, M., 329, 334

Anderson, B. L., 64, 68

Barash, S., 203

Beverley, K. I., 34, 44

Angelaki, D. E., 190, 195,

Barlow, H. B., 112

Bhalla, M., 318, 321, 334, 338

Barnard, K., 317, 334

Bhattacharyya, R., 196, 204

Annan Jr., A., 290, 305

Bartlett, F. C., 329, 334

Biederman, I., 257, 275

Annan, V., 305

Basri, R., 293, 304

Bishop, P. O., 13, 15, 23, 40,

Anstis, S. M., 226

Batista, A. P., 193, 195, 203

200, 202, 203, 205

Antes, J. R., 257, 275 Anzai, A., 15, 16, 20, 24, 30, 37, 40

Battaglia-Mayer, A., 191, 201, 203 Bauer, R., 24, 27, 28, 40

42, 43 Blackburn, S., 49, 57, 68, 135 Blake, R., 30, 34, 40, 43, 112 Blakemore, C., 112, 131, 133,

Arbelaez, P., 157

Bäuml, K.-H., 283, 304

Arditi, A., 23, 40

Bay, H., 235, 249

Blakemore, M., 181, 184

Arend, L. E., 283, 304

Beck, D. M., 279

Blanz, V., 336

264, 275

341

342

Author index Blohm, G., 190, 204

Caddigan, E., 279

Cook, M., 49, 53, 55, 67–69

Bloj, M., 290, 294, 304, 307

Calonder, M., 250

Cooper, E. A., 116, 134

Blumenthal, B. F., 338

Caminiti, R., 203

Cormack, L. K., 113, 115, 134,

Bock, O., 195, 204

Candel, I., 324, 334

Bonato, F., 305

Canny, J. F., 127, 134

Cornelissen, F. W., 305

Bonds, A. B., 44, 158

Cant, J. S., 260, 275

Coughlan, J. M., 123, 134

Bourque, E., 228, 249

Carlson, E. T., 279

Cox, D. D., 269, 271, 275, 279

Bowman, J., 250

Carlson, T., 277

Crawford, J. D., 190, 201,

Bowman, K. C., 279

Carney, T., 36, 40

Boyaci, H., 283, 285–287,

Carozzo, M., 201, 204

Creem-Regehr, S H., 318, 335

289, 290, 292, 294, 296,

Carson, C., 317, 334

Crowell, J. A., 40, 115, 133

297, 302, 304, 306, 321,

Cataliotti, J., 305

Culham, J. C., 192, 204, 276,

337

Caudek, C., 116, 134

159

204–206

278

Boynton, G. M., 268, 275

Cecchi, G. A., 159

Cullen, K. E., 202, 203

Boynton, R. M., 284, 306

Celebrini, S., 28, 41

Cumming, B. G., 20, 23, 29,

Bracewell, R. M., 203

Chang, E., 194, 204

30, 32–34, 38, 41, 44,

Bradford, C. M., 74, 94

Chang, H. S., 231, 250

139, 143, 144, 155, 156,

Bradley, D. C., 34, 40

Chang, J.-J., 146, 147, 151,

158, 159, 164–166, 170,

Bradshaw, M. F., 23, 28, 44, 75–77, 83, 89, 92–94 Brady, T. J., 277 Brainard, D. H., 307

157

172, 175, 181, 182, 184

Changeau, P., 135

Curran, T., 276

Charman, W. N., 123, 131,

Cutting, J. E., 7, 146, 156,

136

Brandt, T., 45, 226

Chen, C. C., 95, 97, 112, 113

Brenner, E., 218, 226, 227

Chen, J. D., 250

Bridge, H., 29, 43

Chen, Y., 11, 15, 16, 20,

Bridgeman, C. S., 115, 134

30–34, 39, 40, 139, 140,

Brillaut-O’Mahony, B., 134

142, 156

318, 335

Da Silva, A., 337 Dale, A. M., 275 Daniels, K. K., 324, 330, 335

Broder, A. Z., 238, 240, 249

Cheng, K., 260, 275

Danoff, J. V., 339

Brooks, J. L., 117, 131, 135

Cherrier, M. M., 261, 278

Das, A., 157

Brooks, K. R., 179–181, 184

Chun, M. M., 204, 263, 278,

Daugman, J. G., 15, 36, 41,

Brown, J. S., 308, 331, 336

308, 331, 332, 334, 337

42, 234, 250

Buckner, R. L., 267, 275, 277

Chung, S. H., 133, 135

Daum, I., 194, 203

Bueno, C. A., 192, 203

Chvatal, V., 237, 250

Davis, L. S., 311, 335

Bülthoff, H. H., 71, 93, 205

Ciuffreda, K. J., 124, 134

DeBruyn, B., 215, 226

Burbeck, C., 135

Ciurea, F., 284, 304

De Vrijer, M., 201, 202, 204

Burge, J. D., 122, 134

Clark, J. J., 273, 278

De Weerd, P., 277

Burgess, N., 194, 204,

Cohen, I. S., 207

DeAngelis, G. C., 14–16, 18,

332–334 Burggren, W., 206 Burnham, C. A., 312, 313, 334 Burock, M., 275

Cohen, Y. E., 191, 204 Colby, C. L., 190, 192, 193, 204, 206

20, 24, 29, 30, 34, 41, 43, 158 Decker, K. E., 124, 136

Collett, T. S., 23, 44, 147, 156

Deguchi, K., 123, 135

Collewijn, H., 215, 217, 218,

Dellaert, F., 231, 250

Burr, D. C., 36, 40

226, 227

Byrne, P., 332, 334

Connor, C. E., 279

Demos, K. E., 279 Deneve, S., 203, 206

Author index Desimone, R., 271, 275, 277, 278

Efros, A. A., 133, 134

Flanders, M., 189, 206

Egusa, H., 133, 134

Fleet, D. J., 18, 30, 41

Desmurget, M., 207

Ekstrom, L. B., 278

Fleischl, E. V., 226

DeSouza, J. F., 194, 204

Elder, J. H., 152, 157

Fleming, R. W., 303, 305

D’Esposito, M., 261, 274, 332,

Ellerbrock, V. J., 115, 134

Flickr, 119, 125, 134

Epstein, R., 259–261, 265,

Flock, H. R., 285, 305

334 DeValois, R. L., 44

267, 270, 271, 275–278,

Fogassi, L., 203

DeYoe, E. A., 275

311, 321, 332, 335,

Foley, J. D., 1, 7

Diana, R. A., 270, 275

338

DiCarlo, J. J., 269, 271, 275, 277, 279

Epstein, W., 285, 305, 311, 321, 332, 339

Dichgans, J., 45, 226

Erickson, R. E., 44

Dickinson, C. A., 258, 277,

Ericson, J., 333, 335

336 Diedrichsen, J., 207

Erkelens, C. J., 213, 217, 218, 226, 227

Foley, J. M., 76, 93, 97, 113 Forsyth, D. A., 317, 334 Fortenbaugh, F. C., 318, 335 Foster, D. H., 283, 298, 303, 305, 306 Fowlkes, C., 122, 134, 157, 158

Diener, H. C., 45, 226

Ernst, M. O., 136, 339

Franz, G., 312, 313, 339

Diner, D. B., 164, 184

Evans, K. K., 273, 276

Freedberg, E., 285, 305

Dixon, M. W., 339 Doerschner, K., 285, 286, 289, 290, 292–294, 303, 304, 306, 307 Dolan, R. J., 279 Dow, B. M., 24, 27, 28, 40 Downing, P. E., 259, 276, 332, 335 Doxa, M., 338 Driver, J., 279 Dror, R. O., 282, 303–305 Droulez, J., 202, 205 Drucker, D. M., 264, 275 Ducros, M., 277 Dudek, G., 228, 230, 231, 233, 249, 250 Duhamel, J. R., 193, 195, 203, 204, 206 Dukelow, S. P., 204 Duncan, J., 271, 275 Duncker, K., 226 Durand, J. B., 28, 41 Dyde, R. T., 226

Freeman, R. D., 11, 14–16, Fahle, M., 37, 43 Falk, D. S., 36, 41 Fang, F., 268, 276 Farell, B., 23, 41, 181, 184 Farid, H., 123, 134 Farrokhnia, F., 234, 250 Fei-Fei, L., 257, 264, 271, 276, 278, 279 Feiler, A. M., 262, 265, 276, 332, 335 Felleman, D. J., 27, 41 Fender, D. H., 164, 184 Fernández, C., 202, 204 Fernandez, J. M., 181, 184 Festinger, L., 226 Field, D. J., 95, 113, 157 Fielding, R., 119, 134 Filehne, W., 221, 226 Fink, G. R., 204 Finneym E. M., 268, 275

18, 20, 24, 30, 32, 34, 36–38, 40, 41, 43, 44, 139, 142, 155, 157, 158 Freeman, T. C. A., 123, 134, 214, 218, 226 Freeman, W. T., 139, 146, 159 Freiwald, W. A., 269, 279 French, K., 206 Friedman, A., 311, 335 Fries, P., 206 Frisby, J. P., 41, 134 Frost, B., 30, 45 Fry, G. A., 115, 133, 134 Fua, P., 250 Fujita, I., 29, 44 Fujita, N., 337 Fukuda, K., 177, 184 Fukusima, S. S., 337 Funt, B., 284, 304 Furey, M. L., 276

Fischer, B., 13, 43 Fisher, S. K., 124, 134

Galileo Galilei, 209

Fitzgibbon, A., 93, 159

Gallagher, K., 329, 335

Fitzgibbon, J., 207

Gallistel, C. R., 260, 263, 276

Eagle, R. A., 116, 134

Fitzpatrick, D., 271, 277

Gallogly, D. P., 157

Economou, E., 305

Fize, D., 257, 279

Ganel, T., 267, 276

Dynkin, E. B., 241, 250

343

344

Author index Harris, J. M., 34, 41, 69, 178,

Ganesan, N., 205

Graf, E. W., 181, 185

Gao, J., 135

Grafton, C. E., 178, 185, 227

179, 181, 183–185, 211,

Gårding, J., 23, 41, 131, 134

Grafton, S. T., 279

215, 217, 219, 220, 222,

Gati, J. S., 204

Graham, K. S., 259, 276, 332,

Gazzaniga, M. S., 331, 337

335

224–227 Harris, L. R., 226

Grajewski, T., 338

Hauck, R., 307

Genovese, C. R., 206

Granzier, J. J. M., 284, 305

Hawkins, C. B., 321, 338

Georgeson, M. A., 123, 134

Gray, R., 179, 181, 183–185,

Haxby, J. V., 264, 271, 276,

Geisler, W. S., 152, 157

Gerbino, W., 331, 336 Gerhard, H. E., 296, 297, 299, 300, 303, 305

219, 226 Greene, M. R., 258, 276, 322, 323, 332, 335, 338

277 Hayes, A., 95, 113, 157 He, S., 276, 277

German, K. J., 222, 224–226

Greenspan, H., 334

Hebert, M., 133, 134

Gershun, A., 280, 305

Greenwald, S., 307

Hecaen, H., 277

Gettner, S. N., 201, 206

Grefkes, C., 194, 204

Hecht, H., 334

Gibson, J. J., 4, 7

Grenner, E., 305

Hedges, L. W., 329, 336

Gilbert, C. D., 152, 157, 159

Grifﬁn, B. W., 222, 226

Heeger, D. J., 20, 29, 30, 41,

Gilchrist, A. L., 290, 305

Grigorescu, S. E., 234, 250

Gillam, B., 23, 41, 48, 49, 55, 57, 58, 60, 64, 66–69, 75, 93, 179, 184

Grill-Spector, K., 264, 276, 325, 335 Grosof, D. H., 44

Girdhar, Y., 230, 233, 250

Grossberg, S., 58, 69

Glass, A. L., 275

Grove, P. M., 57, 64, 66–69

Glennerster, A., 71, 76,

Guitton, D., 190, 204

78–82, 85, 89, 91–93

43 Held, R., 116, 119, 134, 320, 335 Helmholtz, H. von, 4, 7, 21, 70, 71, 74, 75, 78, 83, 91–94, 115, 134, 172, 185, 190, 205 Hemmer, P., 329, 335 Henderson, J. M., 308, 336

Gnadt, J. W., 203 Gobbini, M. I., 276

Habib, M., 261, 276

Henderson, S. T., 280, 305

Gogel, W., 76, 94, 222, 226

Haeberli, P., 125, 134

Henn, V., 202, 207

Goldberg, J. M., 202, 204

Haefner, R. M., 32, 41

Henriques, D. Y., 195, 196,

Goldberg, M. E., 192, 193,

Häkkinen, J., 49, 64, 65, 67,

204, 206, 207

69

205 Henry, G. H., 24, 27, 45

Goldberg, R. M., 152, 157

Hallett, P. E., 190, 204

Golomb, J. D., 194, 204

Halper, D. L., 166, 184

Goltz, H. C., 206

Hamstra, S. J., 185

Hermer, L., 260, 277

Golz, J., 284, 305, 306

Hands, P. J. W., 135

Hernández-Andreés, J., 280,

Gong, Y., 231, 250

Hanrahan, P., 284, 293,

Gonsalves, B. D., 267, 276

307

Henson, R., 264, 276, 277, 279, 325, 335

306 Hersh, S., 285, 304

Gonzalez, C. L., 276

Hao, L., 335

Hess, B. J., 205

Gooch, A. A., 335

Hara, K., 284, 305

Hess, R. F., 95, 96, 113, 157

Goodale, M. A., 260, 275, 276,

Haralick, R. M., 280, 305

Hesse, G. S., 123, 134

Harasawa, K., 186

Hicks, J. C., 335

Goodman, J., 275

Harber, K. D., 338

Hiester, P. M., 338

Gossweiler, R., 338

Hariharan, S., 113

Higashiyama, A., 318, 336

Goutcher, R., 4, 7

Harper, T. M., 203

Higgins, J. S., 262, 265, 275,

Govan, D. G., 117, 135

Harris, A., 259, 276

278

276

Author index Hochberg, J., 285, 305, 308, 331, 336 Hoffman, D. M., 135 Hoffman, J. E., 324, 336

Jinag, H., 277

Kingdom, F. A., 113

Johnson, M. K., 331, 337

Kingslake, R., 119, 134

Johnston, E. B., 23, 41, 76, 82,

Kiran, K., 307

94

Kirby, A. K., 135

Hogervorst, M. A., 116, 134

Jones, L., 334

KIrsch, A., 249

Hoiem, D., 133, 134

Judge, S. J., 74, 94

Klaus, A., 146, 157

Hollingworth, A., 308, 336

Julesz, B., 4, 7, 99, 111, 113,

Holtzman, J. D., 226

137, 146, 147, 151, 157,

Hong, X., 185

164, 166, 184, 185

Houben, K., 334 Howard, A., 164, 184, 185 Howard, I. P., 13, 21, 23, 28, 29, 41, 42, 71, 74, 79, 83, 92, 94, 164–166, 176, 180, 184, 185 Howe, P. D. L., 58, 69 Hubel, D. H., 18, 42, 268, 277 Huggins, P. S., 156 Humphrey, G. K., 278 Hung, C. P., 269, 277 Hurlbert, A. C., 284, 290, 304, 305 Husain, M., 192, 205 Huttenlocher, J., 329, 336 Ikeda, M., 290, 303, 306 Ikeuchi, K., 284, 305 Indow, T., 318, 336 Intraub, H., 258, 277, 324, 330, 331, 335–337 Irwin, D. E., 308, 331, 336 Ishai, A., 259, 276, 277 Ito, M., 157 Itti, L., 231, 232, 250 Iyer, A., 276

Klein, S. A., 95, 113 Klier, E. M., 190, 195, 200, 205 Knill, D. C., 123, 135

Kahn, I., 276 Kaiser, P. K., 284, 306 Kakehi, D., 186 Kamitani, Y., 264, 277 Kandel, E. R., 188, 205 Kaneko, H., 23, 28, 41, 42 Kanizsa, G., 331, 336 Kanwisher, N., 259, 260, 265, 275–277, 332, 335 Kapadia, M., 152, 157 Kardos, L., 290, 306 Karner, K., 157 Karp, R. M., 236, 250 Kastner, S., 271, 277, 278 Katz, D., 290, 306 Kaufman, L., 4, 7, 23, 40, 172, 185 Kaye, M., 51, 69 Kelley, W. M., 279 Kennedy, W. A., 277 Kenner, M. A., 166, 185 Kepler, J., 97, 99, 100, 109, 110, 113 Kersten, D., 304, 321, 337

Koch, C., 276 Koenderink, J. J., 23, 42, 44, 133, 135, 284, 296, 297, 306, 309, 336 Kohler, S., 276 Kohly, R., 185 Konkle, T., 327, 328, 330, 336 Konolige, K., 230, 250 Kontsevich, L. L., 59, 69, 99, 100, 105, 113 Körding, K. P., 202, 205 Kourtzi, Z., 265, 277 Koustall, W., 275 Krauskopf, J., 278 Kreiman, G., 277 Kriegman, D., 282, 304 Krommenhoek, K. P., 190, 195, 205 Kruizinga, P., 250 Kulikowski, J. J., 99, 105, 113 Kullback, S., 231, 250 Kumar, R., 249 Kuroki, D., 64, 69 Kwong, K. K., 277

Izo, T., 156

Kezuka, T., 186 Khan, A. Z., 194, 205

Lacquaniti, F., 203, 204

Jablonski, K., 276

Khan, R., 305

Laforet, V., 119, 135

Jackson, S. R., 192, 205

Kim, B. W., 278

Lages, M., 181, 185

Jacobs, D., 293, 304

Kim, D.-S., 277

Land, E. H., 283, 306

Jain, A. K., 234, 250

Kim, M., 261, 277

Landau, S. M., 338

Jenkin, M., 18, 41

King, A. P., 263, 276

Landers, D. D., 115, 134

Jensen, O., 206

King, J. K., 334

Landy, M. S., 71, 94, 289, 306

Jepson, A. D., 18, 41

King-Smith, P. E., 99, 105,

Langer, M., 125, 135

Jessell, T. M., 205

113

Lappin, J. S., 318, 336

345

346

Author index Latecki, L. J., 152, 157 Laurens, J., 202, 205

MacEvoy, S. P., 270, 271, 277, 278

Mayhoe, M. M., 205 Mays, L. E., 190, 205

Lawergren, B., 23, 41, 75, 93

Mack, A., 213, 222, 227

Mazer, J. A., 204

Le Clerc, J., 135

MacLeod, D. I., 30, 44, 284,

McCann, J. J., 283, 306

Ledden, P. J., 277

305, 306

McCloskey, S., 125, 135

Lee, D. N., 166, 185

MacNeilage, P. R., 202, 205

McIntyre, J., 204

Lee Jr., R. L., 280, 306

Maguire, E. A., 204, 332, 337,

McKee, S. P., 13, 21, 29, 34,

Lee, S. U., 250

338

41, 42, 44, 147, 158, 181,

Lee, T. S., 133, 135, 155, 158

Maire, M., 152, 157

Lefvre, P., 204

Malach, R., 262, 264, 276, 277

McKenzie, A., 190, 205

Lehky, S. R., 21, 42

Malbert, E., 135

McLean, J., 15, 42

Lennie, P., 278

Malik, J., 157, 158, 278, 334

McNawara, T. P., 333, 337

Leonardo da Vinci, 47

Mallot, H. A., 71, 93

Medendorp, W. P., 190, 194,

Leopold, D., 323, 336

Maloney, L. T., 71, 94, 280,

184, 226

195, 200, 204–207

Lepetit, V., 250

283–285, 289, 290, 292,

Melcher, D., 190, 206

Lester, C. F., 285, 307

299, 303–307

Mendez, M. F., 261, 278

Leventhal, A. G., 24, 27, 28, 42 Levi, D. M., 13, 29, 42, 45, 95, 113

Mamassian, P., 4, 7, 181, 185

Meng, X., 42

Mangels, J. A., 258, 277, 324,

Menon, R. S., 204, 278

336 Mansﬁeld, J. S., 97, 113

Menz, M. D., 32, 43, 139, 142, 155, 157, 158

Levitan, H., 158

Mansﬁeld, R. J. W., 36, 42

Merckelbach, H., 324, 334

Levy, E. I., 257, 278

Marcˇelja, S., 15, 42

Merriam, E. P., 194, 206

Li, L., 264, 278

Marlot, C., 257, 279

Meth, A. B., 278

Li, N., 190, 195, 200, 202, 205

Marotta, J. J., 204

Metzger, R. L., 275

Li, X. J., 305

Marr, D., 11, 13, 32, 38, 42,

Miall, R. C., 194, 203

Lichtenberg, B. K., 207

99, 113, 142, 145, 146,

Lightstone, A. D., 190, 204

148, 149, 155, 157, 263,

Midgett, J., 338

Likova, L. T., 222, 227

278, 317, 337

Mihelich, P., 250

Lisberger, S. G., 190, 205

Marshall, J., 117, 131, 135

Liu, L., 23, 42, 48, 69

Martin, A., 264, 268, 276,

Liu, X., 250 Livingstone, M. S., 20, 42, 255, 279

277, 279, 325, 335

Michel, L., 313, 337

Mikaelian, S., 18, 20, 21, 30, 43, 44 Miles, F. A., 74, 94

Martin, K., 135

Miller, E. K., 264, 278

Martinez, L. M., 20, 40

Miller, M. B., 331, 337

Lobos, J.-P., 231, 250

Maske, R., 13, 42

Milner, A. D., 278

Longuet-Higgins, H. C., 23, 42

Matheny, J., 335

Loomis, J. M., 318, 337

Missal, M., 204

Mather, G., 116, 117, 123,

Mitchison, G. J., 147, 158

Love, G. D., 124, 135 Lowe, D. G., 231, 235, 250 Lowy, D., 205 Lu, C., 157 Lyle, K. B., 331, 337

131, 135 Mathew, G., 181, 184 Matthews, N., 11, 23, 24, 28, 42 Maunsell, J. H. R., 34, 42

Mitsudo, H., 64, 69 Mittelstaedt, H., 190, 193, 201, 206, 207 Mitzenmacher, M., 249 Mizokami, Y., 306

Maxwell, J. S., 215, 227

Mon-Williams, M., 124, 135

Ma, Y. F., 250

May, K. A., 123, 134

Money, K. E., 207

Maccotta, L., 267, 277

Mayhew, J. E., 23, 41, 42, 134

Montagon, R., 135

Author index Moore, G. P., 148, 158 Morgan, L. K., 270, 278

319, 322, 323, 325, 327,

Pianta, M. J., 58, 60, 69

328, 330–332, 335–338

Pietrini, P., 276

Morgan, M. J., 36–38, 43

Olivier, E., 203

Pippitt, H. A., 262, 278

Motter, B. C., 43, 113

Olson, C. R., 206

Pirenne, M. H., 309, 338

Movshon, J. A., 23, 40, 44

Oman, C. M., 207

Pisella, L., 205

Mruczek, R., 205

Ono, H., 57, 64, 69, 135

Pitblado, C. B., 172, 185

Muller, J. R., 268, 278

Ooi, T. L., 318, 339

Poggio, G. F., 13, 16, 36, 43,

Mumford, D., 146, 156

Orban, G. A., 215, 226, 278

Murray, S. O., 276, 321, 337

O’Regan, J. K., 273, 278

Musallam, S., 204

Oruc, I., 289, 306 O’Shea, R. P., 117, 133, 135

113 Poggio, T., 11, 13, 32, 36, 42, 43, 142, 145, 146, 148, 149, 155, 157, 277

Nachmias, J., 264, 275

O’Sullivan, D., 338

Polat, U., 95, 113, 152, 158

Nakamizo, S., 64, 69

Otis, J., 166, 185

Pong, T. C., 166, 185

Nakamura, K., 193, 206

O’Toole, A., 336

Pont, S. C., 284, 306, 307 Poplin, R. E., 158

Nakayama, K., 48, 49, 51, 57, 64, 68, 69 Nascimento, S. M. C., 283, 303, 305, 306 Nawrot, M., 34, 43 Nefs, H. T., 178, 185, 217, 220, 222, 224, 225, 227

Pal, C., 146, 159 Pallis, C. A., 261, 278 Palmer, L. A., 15, 42 Palmer, S. E., 115, 117, 131, 135, 331, 337 Panum, P. L., 49, 69

Nelson, S. B., 274

Paradiso, M. A., 36, 40

Neri, P., 29, 43

Park, S., 263, 278, 308, 314,

Newcombe, N. S., 260, 275 Ng, A. Y., 133, 135

325, 331, 332, 337, 338 Parker, A. J., 23, 29, 34, 41,

Ngo, C. W., 231, 250

93, 97, 113, 164–166,

Niemeier, M., 202, 206

170, 172, 175, 181, 182,

Nikara, T., 13, 43 Nishino, K., 284, 305 Nistér, D., 159

184 Parker, W. E., 262, 265, 276, 332, 335

Porrill, J., 41, 134 Portforz-Yeomans, C. V., 164, 178, 181, 185 Potetz, B., 133, 135, 155, 158 Potter, M. C., 257, 278 Pouget, A., 191, 201, 203, 206 Press, D. Z., 275 Prince, S. J. D., 139, 158 Profﬁt, D. R., 116, 134 Profﬁtt, D. R., 318, 321, 334, 338, 339 Psarra, S., 312, 338 Qian, N., 11, 13, 15, 16, 18, 20, 21, 30–34, 36, 38–40, 42–45, 49, 68, 139, 140,

Norman, K. A., 276

Peelen, M. V., 271, 278

Norris, C. M., 124, 136

Pelz, J. B., 205

Nyman, G., 64, 65, 67, 69

Penland, J., 275

Rabinowitz, J. C., 275

Penn, A., 338

Rajimehr, R., 278

Pentland, A. P., 123, 135, 282,

Ramachandran, V. S., 4, 7

O’Brien, J. F., 116, 134 O’Connell, D. N., 4, 7

304

142, 156, 158

Ramakrishnan, S., 115, 134

Ogle, K. N., 99, 113

Perkel, D. H., 158

Ramamoorthi, R., 284, 307

Ohzawa, I., 14–16, 18, 20, 24,

Perona, P., 276

Randall, D., 188, 206

30, 34, 37, 40, 41, 43,

Perry, J. S., 157

Ranganath, C., 275

139, 158

Petkov, N., 250

Ranganathan, A., 231, 250

Okatani, T., 123, 135

Pettet, M. W., 22, 23, 45

Rashbass, C., 29, 44

O’Keefe, J., 204, 311, 334, 337

Pettigrew, J., 13, 15, 40, 43,

Read, J. C. A., 18, 32, 33, 38,

Oliva, A., 258, 276, 278, 308, 310, 311, 313–315, 317,

112 Philbeck, J. W., 318, 337, 339

44, 139, 143, 155, 158 Redding, G. M., 285, 307

347

348

Author index Schnall, S., 321, 338

Sommer, M. A., 193, 206

178–185, 215, 218, 219,

Schnapp, B., 333, 338

Sormann, M., 157

226, 227

Schor, C., 23, 42, 48, 69, 113,

Sparks, D. L., 190, 205

Regan, D., 34, 44, 164, 166,

Reid, I., 159

159, 215, 227

Spehar, B., 283, 304, 305

Ren, X., 152, 158

Schouten, J. L., 276, 277

Spelke, E. S., 260, 277

Renninger, L. W., 278

Schwartz, J. H., 205

Sperry, R. W., 193, 206

Rensink, R. A., 273, 278

Schyns, P. G., 258, 278

Spooner, A., 334

Reppas, J. B., 277

Seamon, J. G., 324, 338

Spring, K., 120, 135

Richards, W., 182, 185

Sedgwick, H. A., 226

Squatrito, S., 43, 113

Richardson, M., 324, 331, 336

Segundo, J. P., 158

Stacy, E. W., 275

Rieser, J. J., 318, 336

Sejnowski, T. J., 21, 42

Stamps, A. E., 312, 313, 338

Ripamonti, C., 285, 287, 292,

Sekuler, R., 117, 135

Stanley, D., 276

Semmlow, J. L., 215, 227

Ro, T., 194, 204

Steeves, J. K., 261, 278

Sen, K., 274

Rogers, B. J., 23, 28, 29, 42,

Stefanucci, J., 338

Shadmehr, R., 207

44, 71, 72, 74, 75, 77, 79,

Steinman, R. M., 226

Shapiro, L. G., 280, 305

81, 83, 93, 94, 166, 185

Stevenson, S. B., 23, 42, 48,

Shelton, A. L., 262, 278, 318,

297, 307

Rolland, J., 135

336

69, 97, 111, 113, 147, 159 Stewénius, H., 159

Rondot, P., 277

Shimojo, S., 49, 51, 64, 69

Ronen, I., 277

Steyvers, M., 329, 335

Shinoda, H., 306

Roorda, A., 124, 136

Stiles, W. S., 100, 113, 120,

Shioiri, S., 164, 166, 176,

Rosen, A. C., 275

179–181, 186

Rosen, B. R., 275, 277

Shrivastava, A., 205

Ross, J., 36, 40

Sigman, M., 152, 159

Ross, M. G., 315, 338

Simoncelli, E. P., 123, 134

Rossetti, Y., 205

Simons, D. J., 273, 278

Rothman, D. B., 333, 338

Sinha, P., 335

Rotte, M., 275

Sirigu, A., 261, 276

Rump, B., 333, 337

Sivic, J., 231, 232, 235, 251 Skottun, B. C., 16, 44

Sadr, S., 185

Slavík, P., 237, 251

Sagi, D., 95, 113, 152, 158

Smallman, H. S., 21, 30, 44

Sahm, C. S., 335

Smeets. J. B. J., 305

Saisho, H., 164, 186

Smith, D. R. R., 116, 117, 123,

Sakano, Y., 180, 185 Samonds, J. M., 139, 147, 151, 152, 155, 158 Sanger, T. D., 18, 44 Sasaki, Y., 264, 278 Sawamura, H., 268, 278

131, 135

135 Stone, L. S., 181, 184 Stratta, F., 204 Stromeyer, C. F., 111, 113 Sull, S., 250 Sumnall, J. H., 211, 215, 227 Super, B. J., 157 Swanston, M. T., 215, 227 Synder, A. Z., 40 Szeliski, R., 146, 159 Szily, A. von., 51, 67, 69

Tanabe, S., 29, 44, 139, 144, 155, 159

Smith, M. A., 205

Tanaka, K., 255, 279

Snowden, R., 34, 44, 181,

Tandy, C. R. V., 311, 338

184 Snyder, J. L., 290, 291, 303, 307

Tappen, M. F., 146, 159 Tashiro, T., 186 Tcheang, L., 93

Saxena, A., 133, 135

Snyder, L. H., 203

te Pas, S. F., 284, 307

Schacter, D., 275

Sobel, E. C., 23, 44

Thomas, O. M., 155, 156

Scharstein, D., 146, 159

Sobel, M., 157

Thompson, P., 36, 43

Schlegel, S. E., 338

Soechting, J. F., 189, 206

Thompson, W. B., 335

Author index Thompson-Schill, S. L., 265, 276

Van Gisbergen, J. A., 190, 195, 204, 205, 207

Wang, Y., 15, 16, 20, 34, 39, 40

Thorpe, S., 257, 279

Van Gool, L., 249

Wang, Z., 146, 159, 279

Todd, J. T., 7, 309, 336

Van Grootel, T. J., 207

Warren, W., 333, 335, 338

Tong, F., 264, 277

Van Opstal, A. J., 207

Watamaniuk, S. N., 34, 41,

Tootell, R. B., 277–279

Van Pelt, S., 195, 200, 201,

Torr, P., 159

206, 207

181, 183, 184, 226 Watanabe, Y., 182, 186

Vanduffel, W., 278

Watson, A. B., 16, 34, 45

313–315, 317, 319, 325,

Vandyck, I., 334

Watt, D. G., 207

337, 338

Vann, S., 332, 338

Watt, S. J., 116, 124, 131, 133,

Torralba, A., 310, 311,

Treisman, A., 273, 276 Tresilian, J. R., 124, 135 Treue, S., 44 Troscianko, T., 133, 135 Trotter, Y., 28, 41, 43, 113 Tsao, D. Y., 20, 42, 255, 269, 279 Tucker, T. R., 271, 277 Turano, K. A., 335 Turner, A., 312, 313, 338 Turner, M. R., 234, 251 Tuytelaars, T., 249 Twedt, E., 321, 338 Tweed, D. B., 205, 206 Tyler, C. W., 36, 44, 59, 69, 95, 97, 99, 100, 105, 112–114, 215, 222, 227 Tzortzis, C., 277 Ugurbil, K., 277

Varlea, J. A., 274 Vassilvitskii, S., 249 Vautin, R. G., 40 Vaziri, S., 202, 207 Verstraten, F. A. J., 323, 339 Vetter, T., 336 Veva, J. L., 329, 336 Vidyasagar, T. R., 24, 27, 45 Vighetto, A., 205 Vilis, T., 204, 206 Vincent, A., 185 Vindras, P., 195, 207 Vishton, P. M., 7, 146, 156, 318, 335 Viviani, P., 207 Vliegen, J., 190, 207 Vogels, R., 278 Von Holst, E., 190, 193, 207 Vuilleumier, P., 265, 279

Umeda, K., 29, 44 Umeno, M. M., 193, 206 Ungerleider, L., 277 Upfal, E., 249 Usui, M., 186

Wade, N. J., 71, 94, 215, 227, 323, 339 Waespe, W., 202, 207 Wagner, A. D., 276 Wagner, H., 30, 41, 45

Van Damme, W. J., 226

Wagner, M., 318, 339

Van den Berg, A. V., 226

Walker, M. F., 193, 207

Van Der Werf, J., 194, 206

Wallach, H., 4, 7, 124, 136

135, 309, 336 van Ee, R., 40, 115, 133 Van Essen, D. C., 27, 34, 41, 42

Welchman, A. E., 219, 225, 227 Werner, S., 333, 337 Wertheim, A. H., 213, 227 Westheimer, G., 21–23, 26, 29, 34, 44, 45, 147, 157, 159 Wetherhold, C. J., 324, 336 Wheatstone, C., 4, 7 Widders, D., 337 Wiener, J. M., 312, 313, 339 Wiesel, T., 18, 42, 268, 277 Wig, G., 267, 279 Wiggs, C. L., 268, 279 Wilcox, L. M., 69, 183, 184, 186 Willsky, A., 282, 304 Wilson, B. J., 124, 136 Wilson, H. R., 30, 40, 112

Valyear, K. F., 192, 204, 276

van Doorn, A. J., 23, 42, 133,

136, 320, 339 Weiss, P. H., 204

Walsh, G., 123, 131, 136 Walther, D. B., 264, 270, 271, 279

Wist, E., 37, 45, 226 Witt, J. K., 321, 339 Wolfe, J. M., 308, 331, 337 Wolpert, D. M., 202, 205 Woods, A. J., 321, 339 Woodford, O., 146, 159 Wu, B., 318, 339 Wurtz, R. H., 193, 206 Xu, P., 42 Yaguchi, H., 164, 186

Wang, L., 159

Yamane, S., 13, 42

Wang, X., 207

Yamane, Y., 255, 279

349

350

Author index Yang, J. N., 284, 307

Zacks, J. L., 308, 331, 336

Yang, Q., 146, 159

Zago, M., 203

Yang, R., 159

Zandbergen, M., 324, 334

Zijiang, J. H., 318, 339

Zarahn, E., 274

Zilles, K., 204

Zhang, H..-J., 250

Zisserman, A., 231, 232, 235,

Yang, T. L., 339 Yi, D. J., 337 Yonelinas, A. P., 275 Young, L. R., 202, 207

Zhang, M., 207

Zhu, Y., 15, 18, 21, 30, 31, 44, 45, 140, 158

251

Yuan, W., 227

Zheng, Z., 146, 159

Zoccolan, D., 271, 279

Yuille, A., 123, 134, 282, 304

Zhou, Z., 158

Zucker, S. W., 156

Subject index

conjunctive eye movements,

3D shape constancy, 76

Bayesian statistics, 202

3D shape judgements, 91

Bayesian surprise, 231

3D Structure of

Berkeley, 115

constancy scaling, 76

binocular depth perception,

continuity constraint, 144

environment, 115

11, 66 absolute disparity, 28, 96 absolute distance, 75, 76 absolute distance from blur, 128 accommodation, 76, 124, 164 achromatic Mondrians, 290

binocular disparity, 76, 96, 164 binocular motion in depth, 222 binocular surface, 48, 49

body motion, 187

allocentric representations,

boundary extension, 258, 324, 325, 328

ambiguous stereopsis, 48, 65 camouﬂage, 51

anterior intraparietal area,

CD, see 0ptchange-ofchange blindness, 273

apparent motion in depth,

change-of-disparity, 164–167,

220, 221, 224

V1, 20, 28, 30, 142, 193, 268

V3A, 193 cross-correlation process, 97 crossed fusion, 62 cyclopean T-junctions, 55, 56

disparity164

anterior lingual region, 261

Aubert–Fleischl effect, 212,

medial temporal, 34, 263

cue integration, 5

Ames room, 71, 78, 317

173

cortical region

V3, 193

blur as a distance cue, 119

atmospheric perspective, 4

conversion on demand, 195

V2, 193

action space, 318, 328

192

convergence angle, 96

blur, 115, 116

allocentric model, 200 190

215

da Vinci, Leonardo, See Leonardo da Vinci 46

169, 170, 172, 174, 178,

da Vinci stereopsis, 48

180, 181, 183

daylight, 280

change of scale, 78

depth contrast, 169

collary discharge, 193

depth signal, 58

compactness, 312

depth updating, 196

bag of words, 235

complex cells, 18, 20, 34

disjunctive eye movements,

Bayesian inference problem,

complex dynamic

145 Bayesian method, 59

stereograms, 14 computer vision, 144

215 disparity attraction, 20 disparity-based depth, 48

351

352

Subject index disparity-based stereopsis, 49, 57, 66 disparity computation, 33 disparity energy model, 137, 139

Google Earth, 125, 128

kitchens, 310

GPS, 188

Kullback-Leibler divergence,

gradient of vertical disparities, 75 grasping, 188, 194

disparity gradient, 99

gravito-inertial cues, 201

disparity map, 32

gymnasiums, 310

disparity masking, 105

231 Lake Wobegon strategies, 238 Lambertian surfaces, 294 landmark views, 230

disparity repulsion, 20

half-disparity, 102

lateral geniculate nucleus, 20

disparity scaling, 75

headcentric framework, 210

lateral intraparietal area, 192

disparity speciﬁcity, 95, 96

Helmholtz H. von, 21, 70, 115

lateral occipital complex,

double-step saccade task, 190

hippocampal place cells, 194

DRDS, see 0ptdynamic

horizontal disparity, 11, 23,

random dot stereogram165 Dunker illusion, 222 dynamic random dot stereogram, 165, 166, 172, 178, 180, 181 eccentric gaze, 23

28, 66 ill-posed problem, 1

lightness, 283

illuminant cues, 284

LIP, see 0ptlateral

illumination frameworks, 289 illusion concept of, 70 image representation, 233 induced effect, 21, 27

edge disparity, 60

induced motion effect, 224

efference copy, 193, 202

induced motion in depth, 222

energy model, 11

inference framework, 155

extraretinal information,

integrating space, 330

212, 219

interocular distance, 74

extrastriate visual areas, 193

interocular time delay, 11, 34

extremum summaries, 235

interocular velocity

eye motion, 187

difference, 164–167, 169, 170, 172–175, 178–183

Filehne illusion, 222

intraparietal sulcus, 192

focal distance, 128

IOVD, see 0ptinterocular

focus, 116, 320 forced perspective, 320 Fourier phases, 18

lie detector, 32, 142 light ﬁeld, 292

echinoderms, 188

elevation disparity, 213

262, 271 Leonardo da Vinci, 46

velocity difference164 IPS, see 0ptintraparietal sulcus

intraparietal area local disparity, 99 looming, 164 MAE, see 0ptmotion after-effects179 Marr, David, 38, 142, 263 matte surface lightness, 280 minimum depth constraint, 48, 60 minimum slant constraint, 60 Mondrian singularity, 283 Modrian stimuli, 280 monocular depth cues, 11 monocular gap stereopsis, 48, 57 monocular masking, 102 monocular motion signals, 179

isovist, 308, 311, 312

motion, 11

fusiform face area, 268

jaggedness, 312

motion energy model, 34

fusiform gyrus, 262

Julesz, Béla, 99

motion in depth, 163, 208,

Galileo, Galilei, 209

k-secretaries summaries, 240

motion-in-depth

gist, 257, 273, 310

Keplerian array, 99, 108

discrimation thresholds,

goal-directed behaior, 188

kitchen neurons, 270

175

foveation, 188 fronto-parallel motion, 208

motion after-effects, 179

220

Subject index motion perception, 213 motion–stereo integration, 33 motor control, 189 MT, see 0ptcortical region, medial temporal multi-voxel pattern analysis, 263 natural landscape, 75 navigation summary, 228 ofﬂine, 235 online, 238 occipitotemporal lobe, 256 occlusion, 11, 51, 65 occlusivity, 312 Ollantaytambo, 2 openness, 314 optic ﬂow, 202 optokinetic, 202 orientation speciﬁc adapttation, 268 otoliths, 201 Panum’s limiting case, 49 parahippocampal place area, 256, 259, 325, 330

PPC, see 0ptposterior parietal cortex prosencephalon, 188

size matching, 78

PRR (parietal reach region),

sky, 286

see 0ptparietal reach area Pulfrich effect, 34 radiospectrophotometer, 280 random dot stereogram, 18, 30, 137 rapid scene recognition, 273 rapid serial visual presentation, 257 RDS, see 0ptrandom-dot stereogram reaching, 188, 194, 195 recognition tasks, 255 recognizing scenes, 256 reference frames, 188 relative depth of objects, 13 relative disparity, 28 remapping, 193 retinal motion, 208 retinal slip, 215 retrosplenial complex, 262, 331 robot navigation, 228 scenes, 256

parallax, 164

scene geometry, 263

parietal reach area, 192

scene recognition, 270

personal space, 318

scene representation, 258

Peru, 2

scene statistics, 285

phantom stereopsis, 48, 62

Scotland, 209

phase-shifted disparity, 142

secretaries hiring problem,

place, 310 place-cells, 310 position-and-phase-shift

240 308 semicircular canal, 201

neurons, 142

sensorimotor

142 posterior parahippocampal region, 261 posterior parietal cortex, 192

space geometry, 317 spaciousness, 312 spatial constancy, 189, 201 spatial envelope, 311, 313 spatial layout, 314, 322 spatial memory, 194 spatial updating, 189, 192, 202 specular cues, 289 specular highlights, 289 stargazers, 209 static stereogram, 14 stereo correspondence problem, 137, 139, 145, 155 stereogram, 21, 32 stereomotion, 181 stereopsis, 13 stereoscope, 163 stereoscopic hysteresis, 163 stereovision, 13 Sun, 280, 286 superior colliculus, 193 surface albedo, 299 surface color, 283 surface color perception, 280 surface shading, 289 surprise, 228 Bayesian, 231 set theoretic, 232 Szily stereogram, 51, 66

selecting a route for walking,

binocular disparity position-shifted disparity,

simple cells, 13–15 size, 75

transformation, 188, 192 set cover problem, 236 shading, 11, 289 shape of visual space, 308, 311, 317

telestereoscope, 72 viewing, 70 texture, 11 tilt-shift effect illusion, 119, 319 TMS see 0pttranscranial magnetic stimulation tool use, 188

353

354

Subject index topographical agnosia, 261

V3, see 0ptcortical region, V3

Vieth-Müller circle, 11

transcranial magnetic

V3A, see 0ptcortical region,

view-based mapping system,

stimulation, 267 translational updating, 201 transverse occipital sulcus, 262

V3A vacation snapshot problem, 228 vergence, 62, 74, 97, 169, 215, 217, 224

uncrossed fusion, 62 under-constrained problem, 1

eye movements, 195, 219 vergence angle, 75

230 virtual reality world, 332 visual snapshot, 262 visual-vestibular sensor fusion, 202 visual Wulst, 30 volume of a space, 319

vergence disparity, 76

uniqueness constraint, 144

version, 215, 224

updating, 194

vertical disparity, 11, 21, 23, 28, 76

Wheatstone stereoscope, 167 whole-body rotation, 200

V1, see 0ptcortical region, V1

vestibular nuclei, 202

Wobegon summaries, 238

V2, see 0ptcortical region, V2

vestibular system, 201

wormholes, 332