Lecture Notes in Control and Information Sciences Editor: M. Thoma
237
Springer London Berlin Heidelberg New York Barcelona Budapest Hong Kong Milan Paris Santa Clara Singapore Tokyo
DavidJ. Kriegman,GregoryD. Hagerand A. StephenMorse(Eds)
The Confluence of Vision and Control
~
Springer
Series Advisory Board A. B e n s o u s s a n • M.J. G r i m b l e J.L. M a s s e y • Y.Z. T s y p k i n
• P. K o k o t o v i c
• H. K w a k e r n a a k
Editors D a v i d J. K r i e g m a n P h D G r e g o r y D. H a g e r P h D A. S t e p h e n M o r s e P h D D e p a r t m e n t s o f E l e c t r i c a l E n g i n e e r i n g a n d C o m p u t e r S c i e n c e , Yale U n i v e r s i t y P O B o x 2 0 8 2 6 7 Yale S t a t i o n , N e w H a v e n CT 0 6 5 2 0 - 8 2 6 7 , U S A
ISBN 1-85233-025-2 Springer-Verlag Berlin Heidelberg New York British Library Cataloguing in Publication Data The confluence of vision and control. - (Lecture notes in control and information sciences ; 237) 1.Robot vision 2.Robots - Control systems I.Kriegman, David J. II.Hager, Gregory D., 1961III.Morse, A. Stephen, 1939629.8'92 ISBN 1852330252 Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms oflicences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. © Springer-Verlag London Limited 1998 Printed in Great Britain The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Typesetting: Camera ready by editors Printed and bound at the Atheneum Press Ltd., Gateshead, Tyne & Wear 6913830-543210 Printed on acid-free paper
Preface The past decade has seen an increasing interest in the areas of real-time vision, visual tracking, active focus of attention in vision, vehicle navigation, and vision-based control of motion. Although these topics encompass a diverse set of problems spanning the fields of vision, control, robotics, and artificial intelligence, they all share a common focus: the application or processing of visual information in a way which entails the design and analysis of algorithms incorporating concepts studied in the field of control. This collection emerged from the Block Island Workshop on Vision and Control, held from June 23-27, 1997 on Block Island, Rhode Island. The workshop, organized by J. Malik and S. Sastry from the University of California Berkeley, and G. Hager, D. Kriegman, and A.S. Morse from Yale University, included participants from the U.S., Canada, Australia, France, Germany, Israel and Italy, and in the fields of computer vision, control theory, and robotics. It provided a forum for presenting new theoretical results, empirical investigations, and applications as well as an opportunity to discuss future directions for research. The contributions contained in this collection touch on many of the same topics, from foundational issues such as estimation, feedback, stability, delay, and task encoding, to the use of vision for control in the context of visual servoing and non-holonomic systems, the use of control within vision processes, and the application of vision and control in vehicle navigation, grasping, and micro-electro-mechanical systems (MEMS). We have also included a summary of the discussions which took place at the workshop. The Block Island Workshop on Vision and Control was generously supported by the National Science Foundation, the Army Research Office, and Yale University Faculty of Engineering.
Table of C o n t e n t s
Preface ....................................................... R e s e a r c h Issues in V i s i o n a n d C o n t r o l Gregory D. Hager, David J. Kriegman, and A. Stephen Morse . . . . . . . .
V
1
Visual H o m i n g : S u r f i n g o n t h e E p i p o l e s Ronen Basri, Ehud Rivlin, and Ilan Shimshoni . . . . . . . . . . . . . . . . . . . . . .
11
R o l e o f A c t i v e V i s i o n in O p t i m i z i n g V i s u a l F e e d b a c k for Robot Control Rajeev Shaxma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
A n A l t e r n a t i v e A p p r o a c h for I m a g e - P l a n e C o n t r o l o f R o b o t s Michael Seelinger, Steven B. Skaar, and Matthew Robinson . . . . . . . . . .
41
P o t e n t i a l P r o b l e m s o f S t a b i l i t y a n d C o n v e r g e n c e in Image-Based and Position-Based Visual Servoing Franqois Chaumette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
What Can Be Done with an Uncalibrated Stereo System? Jo~o Hespanha, Zachary Dodds, Gregory D. Hager, and A.S. Morse . . .
79
V i s u a l T r a c k i n g o f P o i n t s as E s t i m a t i o n on t h e U n i t S p h e r e Alessandro Chiuso and Giorgio Picci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
E x t e n d i n g Vis u al S e r v o i n g T e c h n i q u e s t o N o n h o l o n o m i c Mobile Robots Dimitris P. Tsakiris, Patrick Rives, and Claude Samson . . . . . . . . . . . . . .
106
A Lagrangian Formulation of Nonholonomic P a t h Following Ruggero Frezza, Giorgio Picci, and Stefano Soatto . . . . . . . . . . . . . . . . . .
118
V i s i o n G u i d e d N a v i g a t i o n for a N o n h o l o n o m i c M o b i l e R o b o t Yi Ma, Jana KovseckA, and Shankar Sastry . . . . . . . . . . . . . . . . . . . . . . . .
134
VIII
Table of Contents
Design, D e l a y a n d P e r f o r m a n c e in G a z e C o n t r o l : E n g i n e e r i n g a n d Biological A p p r o a c h e s Peter Corke . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
146
The Separation of P h o t o m e t r y and G e o m e t r y Via Active Vision Ruzena Bajcsy and Max Mintz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
159
Vision-Based System Identification and State Estimation William A. Wolovich and Mustafa Unel . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171
V i s u a l T r a c k i n g , A c t i v e Vision, a n d G r a d i e n t Fl ow s Allen Tannenbaum and Anthony Yezzi, Jr . . . . . . . . . . . . . . . . . . . . . . . . . .
183
Visual Control of Grasping Billibon H. Yoshimi and Peter K. Allen . . . . . . . . . . . . . . . . . . . . . . . . . . . .
195
D y n a m i c Vis io n M e r g i n g C o n t r o l E n g i n e e r i n g a n d A I Methods Ernst D. Dickmanns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
210
R e a l - T i m e P o s e E s t i m a t i o n a n d C o n t r o l for C o n v o y i n g Applications R. L. Carceroni, C. Harman, C. K. Eveland, and C. M. Brown . . . . . . . .
230
V i s u a l R o u t i n e s for Vehicle C o n t r o l Garbis Salgian and Dana H. Ballard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
244
Microassembly of Micro-electro-mechanical Systems (MEMS) using Visual Servoing John Feddema and Ronald W. Simon . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
257
T h e B l o c k Island W o r k s h o p : S u m m a r y R e p o r t Gregory D. Hager, David J. Kriegman, and A. Stephen Morse with contributions from P. Allen, D. Forsyth, S. Hutchinson, J. Little, N. Harris McClamroch, A. Sanderson, and S. Skaar . . . . . . . . . . . . . . . . . . . . . . . . . . 273
List of C o n t r i b u t o r s
Peter K. A l l e n Department of Computer Science Columbia University New York, NY 10027 USA
[email protected] Ruzena Bajcsy GRASP Laboratory University of Pennsylvania Philadelphia, PA 19104 USA
[email protected] Dana H. Ballard Computer Science Department University of Rochester Rochester, NY 14627 USA
[email protected] R o n e n Basri Department of Applied Math The Weizmann Inst. of Science Rehovot 76100 Israel
[email protected] R o d r i g o L. Carceroni Department of Computer Science University of Rochester Rochester, NY 14627 USA
[email protected] Francois C h a u m e t t e IRISA / INRIA Rennes Campus de Beaulieu 35 042 Rennes cedex, France Francois.ChaumetteQirisa.fr
Alessandro Chiuso Dipartimento di Elettronica e Informatica Universit~ di Padova
Padua, Italy
[email protected] Christopher M. B r o w n Department of Computer Science University of Rochester Rochester, NY 14627 USA
[email protected] P e t e r Corke CSIRO Manufacturing Science and Technology PO Box 883 Kenmore, 4069 Australia
[email protected] Ernst D. D i c k m a n n s Universit~t der Bundeswehr Mfinchen D-85577 Neubiberg, Germany
[email protected] Zachary D o d d s Center for Computational Vision & Control Department of Computer Science Yale University New Haven, CT 06520-8285 USA
[email protected]
X
List of Contributors
Chris K. E v e l a n d Department of Computer Science University of Rochester Rochester, NY 14627 USA
[email protected] Dr. J o h n F e d d e m a Sandia National Laboratories PO Box 5800, MS 1003 Albuquerque, NM 87185 USA
[email protected] R u g g e r o Frezza Universit~ di Padova via Gradenigo 6a 35100 Padova - Italy frezzaQveronica.dei.unipd.it Gregory D. Hager Center for Computational Vision & Control Department of Computer Science Yale University New Haven, CT 06520-8285 USA hagerQcs.yale.edu C. H a r m a n Department of Computer Science University of Rochester Rochester, NY 14627 USA Jo~o Hespanha Center for Computational Vision Control Dept. of Electrical Engineering Yale University New Haven, CT 06520-8267 USA
[email protected] J a n a Ko~eck~ Electronics Research Laboratory University of California at Berkeley Berkeley, CA 94720 USA j
[email protected]
D a v i d J. K r i e g m a n Center for Computational Vision & Control Dept. of Electrical Engineering Yale University New Haven, CT 06520-8285 USA
[email protected] Yi M a Electronics Research Laboratory University of California at Berkeley Berkeley, CA 94720 USA
[email protected] M a x Mintz GRASP Laboratory University of Pennsylvania Philadelphia, PA 19104 USA
[email protected] A. S t e p h e n M o r s e Center for Computational Vision & Control Dept. of Electrical Engineering Yale University New Haven, CT 06520-8267 USA
[email protected] Giorgio Picci Dipartimento di Elettronica e Informatica Universit~ di Padova Padua, Italy
[email protected] Patrick R i v e s INRIA Sophia-Antipolis 2004, Route des Lucioles, B.P. 93 06902, Sophia Antipolis Cedex France
[email protected]
List of Contributors Ehud Rivlin Department of Computer Science The Technion Haifa 32000 Israel
[email protected] Matthew Robinson Department of Aerospace and Mechanical Engineering Fitzpatrick Hall of Engineering University of Notre Dame Notre Dame, Indiana 46556-5637
[email protected] Garbis Salgian Computer Science Department University of Rochester Rochester, NY 14627 USA
[email protected] Claude S a m s o n INRIA Sophia-Antipolis 2004, Route des Lucioles, B.P. 93 06902, Sophia Antipolis Cedex France
[email protected] Shankar Sastry Electronics Research Laboratory University of California at Berkeley Berkeley, CA 94720 USA
[email protected] Michael Seelinger Department of Aerospace and Mechanical Engineering Fitzpatrick Hall of Engineering University of Notre Dame Notre Dame, Indiana 46556-5637 Michael.J.Seelinger.
[email protected] R a j e e v Sharma Department of Computer Science & Engineering, The Pennsylvania State University
XI
317 Pond Laboratory, University Park, PA 16802-6106
[email protected] Dr. R o n a l d W . S i m o n Sandia National Laboratories PO Box 5800, MS 1003 Albuquerque, NM 87185 USA S t e v e n B. Skaar Department of Aerospace and Mechanical Engineering Fitzpatrick Hall of Engineering University of Notre Dame Notre Dame, Indiana 46556-5637
[email protected] Stefano Soatto Dept. of Electrical Engineering Washington University St. Louis - MO 63130 USA
[email protected] Ilan S h i m s h o n i Department of Computer Science The Technion Haifa 32000 Israel
[email protected] Allen T a n n e n b a u m Department of Electrical and Computer Engineering University of Minnesota Minneapolis, MN 55455
[email protected] D i m i t r i s P. Tsakiris INRIA Sophia-Antipolis 2004, Route des Lucioles, B.P. 93 06902, Sophia Antipolis Cedex France
[email protected]
XII
List of Contributors
Mustafa Unel Division of Engineering Brown University Providence, RI 02912 USA muQlems.brown.edu
William A. Wolovich Division of Engineering Brown University Providence, RI 02912 USA wawQlems.brown.edu A n t h o n y Yezzi, Jr. Department of Electrical and Computer Engineering University of Minnesota Minneapolis, MN 55455 USA
[email protected] Billibon H. Yoshimi Department of Computer Science Columbia University New York, NY 10027 USA
[email protected]
R e s e a r c h Issues in V i s i o n and C o n t r o l Gregory D. Hager, David J. Kriegman, and A. Stephen Morse Center for Computational Vision and Control Departments of Computer Science and Electrical Engineering Yale University New Haven, CT 06520-8285
S u m m a r y . The past decade has seen an increasing interest in the areas of real-time vision, visual tracking, active focus of attention in vision, and vision-based control of motion. Although these topics encompass a diverse set of problems spanning the fields of vision, control, robotics, and artificial intelligence, they all share a common focus: the application or processing of visual information in a way which entails the design and analysis of algorithms incorporating concepts studied in the field of control, namely feedback, estimation, and dynamics. This chapter, which originally appeared as a white paper distributed to participants before the Block Island workshop on Vision and Control, paints a general picture of several problems and issues at the confluence of vision and control. In the final chapter, we return to some of these topics in light of the discussions which took place during the scheduled break-out groups at the workshop. The chapter is divided into three sections. The first two are a discourse on two questions-Why Vision? and Why Control? Within these sections, we briefly discuss focussed topical areas and list some motivational questions. The final section contains some cross-cutting themes and questions.
1. Why Vision? To begin an inquiry into vision and control, it is appropriate to ask "why focus on vision as a sensor modality?" Clearly, the existence of the h u m a n visual system has always been a strong motivation. H u m a n vision has fascinated philosophers and scientists - - from Aristotle to Kepler to Helmholtz to Gibson - - for millennia. Perhaps more compelling is the fact that nearly all animate biological systems have some type of light-based perception mechanism. The prevalence of biological vision, a sense which often consumes a significant fraction of an organism's brain, suggests that vision should be strongly considered as a staple for artificial animate beings. More pragmatically, the fact that cameras are simple, solid-state, passive sensing devices which are reliable and extremely cheap per bit of d a t a delivered also provides a strong argument for their use. More specifically, if we compare cameras to other commonly available sensory modalities, we see that vision offers several advantages including: 1. F i e l d o f V i e w : Vision is a passive, non-contact sensor which can capture multi-spectral information covering a large region of space - - in fact, panoramic cameras can capture up to a hemisphere of data in a single frame. No other sensing modality can make similar claims.
2
Gregory D. Hager, David J. Kriegman, and A. Stephen Morse
2. B a n d w i d t h : Vision delivers information at a high rate - - 60 or more frames a second - - and with high resolution - - up to a million pixels per frame or more. Although this is still several orders of magnitude less bandwidth than the human visual system, computer vision still provides higher spatial and temporal density coverage then most comparable sensing systems. 3. Accuracy: Under the appropriate conditions and with the right algorithms, cameras are able to provide geometric localizations of features or objects with an accuracy that is one to two orders of magnitude higher than the sampling density of the array. 4. C o n t r o l l a b i l i t y : Cameras outfitted with motorized lenses and mounted on pan-tilt platforms, arms or mobile bases can actively explore and adapt their sensing capabilities to the task and environment. Despite the obvious advantages of vision as a sensing modality it has, even in recent times, received scant attention within the control community. One might argue that this is largely a technological issue; however in point of fact hardware for real-time vision processing has been available for over a decade. What then limits progress? One potential answer ~s that the technology is "not yet ripe" for costeffective vision-based systems. If so, then one might consider the following questions: - Is today's imaging equipment a roadblock to progress? What is the "ideal" imaging sensor - - what limits speed and resolution? Most of today's com-
-
monly available vision hardware is based on television standards established decades ago for entirely different purposes. Although other types of equipment exist (e.g. progressive scanning cameras which offer highsampling rates), they have not penetrated into the research community in significant quantity. Perhaps establishing a "benchmark" imaging system designed for vision-based control would serve to speed progress by avoiding the repetition of the same "learning curve" by every research lab. What are the fundamental limiting factors to vision/vision-based control? For example, given the quantum efficiency of the sensing technology used in today's cameras, is it possible to determine the theoretical sampling limits, the resultant time delay, and hence the limits of performance of a vision-based control system? A more common justification for the current rate of progress in vision as a whole is the inherent difficulty of the vision problem. Despite the fact that vision is ubiquitous in the biological world, this "proof of concept" has not provided a significant amount of insight into how to build similar artificial systems. However, it is the case that for sufficiently well-defined and appropriately structured problems, vision has been shown to be both feasible and practical. Considering this dichotomy, one might consider the following questions:
Research Issues in Vision and Control -
-
3
Why aren't cameras used in more real applications involving dynamics and motion? As noted above, vision is one of the few passive sensing modalities which is able to sense, localize and measure motion within a large workspace. Furthermore, algorithms for measuring motion have been wellunderstood for several years given sufficiently structured circumstances. Is the lack of use of vision an unawareness of these results, or are there more fundamental issues of practicality a n d / o r robustness? Can we study vision alone (particularly within the context of animate systems) ? Is it possible that we are/ocusing on the wrong issues by doing so ? T h a t is, does vision need to be augmented or combined with other sensing modalities? Or, perhaps the issue is not vision itself, but the fact t h a t the problems considered by the vision community are not necessarily those which are most relevant within a control context.
In short, "why vision" is a question whose answer seems to lie at the intersection of theory and practice. Can we take a sensor which is, in theory and through biological demonstration, sufficient for many purposes and effectively apply it to practical problems? More to the point, can we identify problems for which there is no other plausible or practical solution?
2. W h y Control? A fundamental premise of this book is that the principles of ]eedback, estimation, and dynamics are essential to an understanding of the vision problem - particularly when vision is applied to complex tasks involving physical motion of a mechanism. The field of control, which includes studies of feedback, is hence arguably central to the endeavor of vision. Given the scope of topics dealt with in the fields of control and estimation, it is no surprise that there are already many areas of vision-related research that use the concepts of feedback and control in the description, design, or analysis of algorithms. Below we present five short essays, each of which includes a brief overview of a specific topical area and suggests possible questions for consideration. 2.1 V i s i o n W i t h i n C o n t r o l Despite the fact that feedback control systems employing visual sensors have been actively studied for a very long time, only quite recently have control problems special to vision-based motion control begun to attract the attention of the "mainstream" control community. There are no doubt many reasons for this, not the least of which is that this community may not have thought that the control issues concerned with vision sensors are very different t h a n the control issues concerned with other types of sensors. But there are issues
4
Gregory D. Hager, David J. Kriegman, and A. Stephen Morse
which arise in vision-based motion control which are not typical of more familiar feedback configurations. For example, an especially interesting feature of visual-based motion control is t h a t both the process output (e.g., the position of a robot in its workspace) and the reference set-point (e.g., l a n d m a r k or visuallydetermined target) can be and often are observed through the same sensors (i.e., cameras). Because of this unusual architectural feature, it is sometimes possible to achieve precise positioning (at least in the absence of measurement noise), despite not only process model imprecision (as in the case of a conventional set-point control system with a perfect loop-integrator and precise output and exogenous reference sensing) but sensor imprecision as wellJ J u s t when this can be accomplished and how one might go a b o u t setting up the a p p r o p r i a t e error signals is not always apparent, especially when the target is determined by observed features or when the task to be accomplished is more challenging t h a n simple positioning. The main reason for introducing feedback into any control system is, of course, to cause the controlled process to function in a prescribed m a n n e r despite process modeling errors, sensor and a c t u a t o r imprecision and noise. Were it not for these imprecisions and uncertainties, there would hardly be a reason for introducing feedback at all. Within the context of vision-based motion control, it is thus natural to ask if one can get by, at least in some applications, with poorly calibrated cameras, with imperfect lenses, with only partially determined correspondence between features in multiple images. Thinking along these lines suggests other more general questions: 1. What is it that one really needs to extract from an image in order to accomplish a particular task? 2 For example, how much scene reconstruction is needed for precise navigation? In how much detail need object recognition be carried out in order to instruct a robot to remove a p a r t from an automobile engine? How accurately need one reverse engineer a 3Dobject in order for the resulting c o m p u t e r model to be detailed enough to recreate the object with a prescribed precision? 2. The g e o m e t r y of imaging is often well-modeled using projectivities. Is there/Can there be a well-founded theory of estimation and motion control ]or sensors modeled as projeetivities, particularly a theory which is as power]ul as linear system theory is today?
1 Perhaps this is one of the reasons why our eyes are in our heads rather than on the tips of our fingers? Perhaps this, rather than ergonomics, is the real reason why it sometimes proves to be best to not co-locate the sensors (e.g., cameras) and controlled instruments (e.g., scalpel) in a vision-guided surgical procedure? 2 A question similar to this has been posed, and to some extent answered, within the control community: "What do you need to know about a process in order to control it?"
Research Issues in Vision and Control
5
2.2 C o n t r o l w i t h i n V i s i o n
Recently, concepts used in system theory (e.g. partial differential equations or nonlinear estimation) have been applied to the problems of image processing. Here the problem is not to accommodate the physical dynamics of a system in motion, but rather to phrase the extraction of geometric information from one or more images in terms of a dynamical system expressed on image brightness values. Additionally, a number of vision or image understanding architectures are stratified into levels of abstraction, where "low-level" vision modules generally take image data or filtered images as input, while "mid-level" modules operate on geometric features or tokens (e.g. lines, image curves, surfaces, and so forth), and "high level" modules operate on 3-D objects. It is often posited that information flow from higher levels to lower levels can constrain and therefore facilitate the extraction of relevant information by the lower level modules. For example, knowledge of the types of objects expected in a scene might guide segmentation processes which might guide the choice of filters. Furthermore, there is neurophysiological evidence of such feedback pathways between visual areas in primate brains. While there have been few attempts to incorporate feedback between different levels, there have been yet fewer attempts to analyze and develop synthesis tools for such systems. Questions in this area seem to revolve primarily around estimation problems, for example: 1. Can feedback make a silk purse out of a sow's ear? That is, given existing vision modules which may in some sense be deficient when used in a "feedback free" manner, are there ways to utilize feedback between modules to achieve tasks that couldn't be performed without feedback? 2. To what extent can dynamics and feedback be used to minimize the combinatorics inherent in correspondence problems of 3-D object recognition, stereo vision and structure from motion? 3. Are there useful ideas from estimation and control of nonlinear systems that can be applied to the solution of vision problems ? Can these problems be phrased independently from the overall task of the system? 4. What are some of the biological connections/implications of vision in a control loop? Are there well-understood feedback principles in biological vision/control systems which could be usefully mimicked in artificial systems? 5. To what extent do biological vision/control systems use feedback, and to what extent do they use feed-forward? 2.3 V i s i o n - B a s e d
Control of Motion
Within the robotics community, research in the area of vision-based control of motion - - particularly the area of visual servoing - - has addressed the theoretical and practical issues of constructing mechanisms which interact
6
Gregory D. Hager, David J. Kriegman, and A. Stephen Morse
with the environment under direct visual control. Since the seminal work of Shirai and Inoue (who describe how a visual feedback loop can be used to correct the position of a robot to increase task accuracy) in the early 70's, considerable effort has been devoted to the visual control of devices such as camera "heads," robotic manipulators, and mobile vehicles. The last few years have seen a marked increase in published research. This has been largely fueled by personal computing power crossing the threshold which allows analysis of scenes at a sufficient rate to 'servo' a robot manipulator. Prior to this, researchers required specialized and expensive pipelined pixel processing hardware to perform the required visual processing. As suggested above, the motivation for this research is to increase the accuracy, flexibility and reliability of robotic mechanisms. For example, although there are nearly a million robots in the world today, they are largely excluded from application areas where the work environment and object placement cannot be accurately controlled to reduce geometric uncertainty. Flexible manufacturing cells of the future may need a sensor such as vision to successfully define a niche apart from standard manufacturing. On another front, a future application for manipulation may be found within microscopic systems - - here vision ma:~ be needed to increase the accuracy of the system to perform precise operations at the sub-micron level. Automated driving of road vehicles has already been demonstrated - - vision has the advantage of providing automated road following and related driving skills without extensive changes (e.g. insertion of fiducial devices) to existing road systems. Other applications that have been proposed or prototyped span manufacturing (grasping objects on conveyor belts and part mating), teleoperation, missile tracking, and fruit picking as well as robotic ping-pong, juggling, balancing and even aircraft landing. Over the last decade, much of the focus in the community has been on "systematic" issues related to the architecture of the vision and control systems. Perhaps the time has come to try to phrase this debate in more concrete terms. For example: 1. Currently, most control algorithms "abstract" away the true nature of images by assuming that some low-level image processing module extracts (estimates the position of) certain geometric features. Is it necessary/good/practical to separate extraction of (image-level) features from the control or estimation problems/algorithms using that information. For example, is control/estimation directly from image information feasible or possible? 2. There is an ongoing debate about the use of reconstruction for control (similar to the separation principle or, in an adaptive context, the idea of certainty equivalence) versus feedback based directly on image level quantities. How can one define or compare the theoretical/realistic boundaries of what can be done by specific control architectures? Are there other alternatives which have not yet been explored?
Research Issues in Vision and Control
7
3. Much of the work to date in visual servoing has been devoted to the static positioning problem. Are there issues which set apart "static" hand-eye coordination problems (e.g. where the target is stationary) from those which are fundamentally "dynamic" in nature (that is, where the target is moving) ?
2.4 Visual Tracking As suggested in the previous section, a central problem in vision-based control is that of eliciting geometric information from images at real-time rates. When this geometric information pertains to the motion of a single object or target, the problem is termed visual tracking. In addition to its use in vision-based control, visual tracking is often used in human-computer interfaces, surveillance, agricultural automation, medical imaging, and is a central component of many visual reconstruction algorithms. The central challenge in visual tracking is to determine the configuration of a target as it moves through a camera's field of view. This is done by solving what is known as the temporal correspondence problem: the problem of matching a target region in successive frames of a sequence of images taken at closely-spaced time intervals. What makes tracking difficult is the extreme variability possible in the images of an object over time. This variability arises from four principle sources: variation in object pose, variation in illumination, deformations or articulations of the object, and partial or full occlusion of the target. Given the constraint of real-time performance, the challenge in visual tracking is to match the amount of data to be processed to the available computational resources. This can be done in any number of ways, including: simplifying the problem, utilizing specialized image processing hardware, by clever algorithm design, or all of the above. Most of today's systems utilize "focus of attention" by working with features which can be extracted from a small amount of (spatially) local image information. Implicit in this approach is the use of a sufficiently rich and accurate predictive model. Various solution methods have been proposed including 2-D spatial temporal evolution models incorporating rigid or affine image motion, a-priori known rigid threedimensional geometric models, incrementally constructed rigid models, and more complex learned motion models. Some of these models include dynamics - - many do not. Finally, much work in tracking has separated the direct processing of images into "features" (edges, regions of homogeneous color and/or texture, and so forth) and the estimation or observer process associated with "stitching together" those features over time. A slightly different architecture results when the image of the object is viewed directly as the input to the estimation algorithm. Algorithms for region tracking and various types of "snake" tracking of this form have been developed.
8
Gregory D. Hager, David J. Kriegman, and A. Stephen Morse
The area of visual tracking suggests a broad range of technical issues and questions including: 1. Generally, cameras observe rigid or articulated 3-D objects, although the observations themselves are 2-D projections. Can observers be developed
that operate on the 2-D projections without explicit knowledge or estimation of the 3-D dynamics? Note this is essentially dual to the problem of controlling a manipulator in 3-D from 2-D images. 2. As suggested above, the dynamics of a tracking process are often expressed in terms of the geometry of tracked targets, not in terms of the temporal evolution of images. Are there cogent arguments for or against taking a strongly geometric or image-based view of visual tracking processes? More generally, can this question be divorced from the application using tracking (e.g. is tracking just estimation, or should it be considered within the context of control)? 3. A central part of tracking is accommodating changes in the appearance of the target over time. Is it possible to characterize the set of local (differential) and global appearance changes for an object, and to develop thereby effective methods/or estimating motion. Does this extend to "higher-order" models (e.g. Lagrangian motion) in a natural way? 4. Most modern visual tracking algorithms rely strongly on a prediction of the target motion. Are there control approaches that would increase tracking performance for objects with unknown dynamics and or structure? Imagine, for example, trying to track the face of a person as they look about, talk, move their heads, and so forth.
2.5 Active Vision/Hybrid Systems Not all vision processes are subject to the strong time constraints imposed in tracking and control, but most vision processes must still operate in a reliable and timely fashion. Two major factors which impose themselves in any visual processing are the problem of efficient data reduction and the problem of tolerating or accounting for missing information. The former results from the complexity of the image formation process, and the latter results from the properties of the 3-D to 2-D camera projection process. One of the more recent and potentially fertile ties between the vision and control communities is due to work termed "active" or "purposive" vision. Here, the goal is to design vision-based systems which are actively controlled to improve task performance. These concepts can be applied both at "lowlevel" control or "high-level" strategic planning levels. Because the image acquisition process and modal control of algorithms or mechanisms are themselves discrete processes, the resulting systems are often hybrid systems - - they contain both continuous-time and discrete-time components. For example, one way of formulating a navigation problem is as a "high-level" discrete switching or decision-making process for moving from
Research Issues in Vision and Control
9
viewpoint to viewpoint operating in combination with a low-level continuous motion process operating within the neighborhood of each viewpoint. Given the strong systems component of active vision, it would seem that fruitful questions in this area would involve the modeling and construction of complex systems, for example: 1. Within the context of active vision, the use of hybrid systems theory has proven to be a useful "descriptive" tool, but not one that is strongly "prescriptive." Is this due to the fundamental nature and complexity of
the problems, the lack of analysis tools, both, neither? 2. Are there examples of hybrid systems which have been studied in the area of control and which would form useful analogs for system construction in vision? 3. Typically, any realization of a system using vision and control in an active or hybrid sense is large and complex from a software or systems engineering point of view. To what extent are complex systems really a problem of "software engineering" versus analytical design and analysis?
3. T h e
Confluence
of Vision
and
Control
We hope to move beyond a parochial classification of research topics as "vision" or "control" and toward a broader notion of vision and control as a defined area of research. Although the previous sections pose questions about issues and approaches within the separate areas of vision and control, there is the danger that this "misses the point." How can we characterize, catalyze, or synergize research at the confluence of vision and control? Here are some questions that one might consider: 1. In AI, there is the notion that problems may be "AI-complete," meaning that solving the problem would in general entail the development of a complete "intelligence." Is this also true of vision, even within the
context discussed in this workshop ? For example, does "initialization" of vision-based systems require the solution to the "general" vision problem? More precisely, is there a set of (scientifically) interesting, useful and self-contained subproblems in vision which can be solved through the use of ideas studied in control? 2. From a control perspective, is there something that is fundamentally unique about cameras as sensors - - for example, will the fact that cameras are well-modeled by projectivities lead to substantial developments in or applications of nonlinear control? 3. Are there novel issues/approaches/solutions which set "dynamical vision" apart from "static" vision. Or, conversely, can concepts developed to deal with vision/vision-based systems which are dynamical in nature be reflected back to the static image interpretation problem?
10
Gregory D. Hager, David J. Kriegman, and A. Stephen Morse
4. What are the most likely practical implications o] vision and control? That is, are there domains or "markets" that will exploit the results o/ this scientific and technological research? Conversely, are there immediately solvable problems which would have high impact, but which are being overlooked? The remainder of this book considers m a n y of these questions within the context of specific applications a n d / o r approaches to problems combining vision and control. In the final chapter we revisit these themes as they were considered at the Block Island Workshop on Vision and Control.
Acknowledgement. G. Hager was supported by NSF IRI-9420982, NSF IRI-9714967 and ARO-DAAG55-98-1-0168. D. Kriegman was supported under an NSF Young Investigator Award IRI-9257990 and ARO-DAAG55-98-1-0168. A.S. Morse was supported by the NSF, AFOSR and ARO.
Visual Homing: Surfing on the Epipoles Ronen Basri I , Ehud Rivlin 2, and Ilan Shimshoni 2 1 Department of Applied Math The Weizmann Institute of Science Rehovot 76100 Israel 2 Department of Computer Science The Technion Haifa 32000 Israel
S u m m a r y . We introduce a novel method for visual homing. Using this method a robot can be sent to desired positions and orientations in 3-D space specified by single images taken from these positions. Our method determines the path of the robot on-line. The starting position of the robot is not constrained, and a 3-D model of the environment is not required. The method is based on recovering the epipolar geometry relating the current image taken by the robot and the target image. Using the epipolar geometry, most of the parameters which specify the differences in position and orientation of the camera between the two images are recovered. However, since not all of the parameters can be recovered from two images, we have developed specific methods to bypass these missing parameters and resolve the ambiguities that exist. We present two homing algorithms for two standard projection models, weak and full perspective. We have performed simulations and real experiments which demonstrate the robustness of the method and that the algorithms always converge to the target pose.
1. I n t r o d u c t i o n Robot navigation and manipulation often involves the execution of commands which intend to move a robot (or a robot arm) to desired positions and orientations in space. A common way to specify such a c o m m a n d is by explicitly providing the robot with the three-dimensional coordinates of the desired position and the three parameters defining the desired orientation. This method suffers from several shortcomings. First, it requires accurate advance measurement of the desired pose. This is particularly problematic in flexible environments, such as when a robot is required to position itself near an object which may appear at different positions at different times. Secondly, due to occasional errors in measuring the actual motion of the robot, the robot may be unable to put itself sufficiently accurately in the desired position. In this paper we propose a different approach to the problem of guiding a robot to desired positions and orientations. In our m e t h o d the target pose is specified by an image taken from that pose (the target image). The task given to the robot is to move to a position where an image taken by a camera mounted on the robot will be identical to the target image. During the execution of this task the robot is allowed to take pictures of the environment,
12
Ronen Basri, Ehud Rivlin, and Ilan Shimshoni
compare them with the target image and use the result of this comparison to determine its subsequent steps. We refer to the use of images to guide a robot to desired positions and orientations by visual homing. We introduce a new method for visual homing. Our method differs from previous methods [3, 4, 6, 12, 13, 18, 19] in many respects. The method requires the pre-storage of the target image only. It then proceeds by comparing the target image to images taken by the robot, one at a time. No 3-D model of the environment is required, and the method requires no memory of previous images taken by the robot. Thus, the method uses minimal information and can deal also with a moving target. We present two homing algorithms for two standard projection models, weak and full perspective. The algorithms are based on recovering the epipolar geometry relating the current image taken by the robot and the target image. Correspondences between points in the current and target images are used for this purpose. (The problem of finding correspondences between feature points, however, is not addressed in this paper.) Using the epipolar geometry, most of the parameters which specify the differences in position and orientation of the camera between the two images are recovered. However, since not all the parameters can be recovered from two images, we develop specific methods to bypass these missing parameters and resolve ambiguities when such exist. The path produced by our algorithm is smooth and optimal to the extent that is possible when only two images are compared. Furthermore, both simulations and real experiments demonstrate the robustness of the method and that the path produced by the algorithm always converges at the target pose.
2. H o m i n g
Under
Weak-Perspective
2.1 D e r i v a t i o n Our objective is to move the robot to an unknown target position and orientation S, which is given in the form of an image I of the scene taken from that position. At any given step of the algorithm the robot is allowed to take an image 11 of the scene and use it to determine its next move. Denote the current unknown position of the robot by S I, our goal then is to lead the robot to S. WLOG we can assume that the pose S is the identity pose. Let Pi -(Xi, Yi, Zi) T, 1 < i < n, be a set of n object points. Under weak-perspective projection, the image at the target pose is given by x~ = X~,
y~ = Y~.
A point p~ = (x~, y~)T in the current image I' is given by
p~ : [sRPi](1,2) + t,
Visual Homing: Surfing on the Epipoles
13
where R is a 3 • 3 rotation matrix, s is some positive scale factor, t E ~t 2 is the translation in the image, and [-](1,2) denotes the projection to the first and second coordinates of a vector. [7, 8, 10] showed that using at least four corresponding points in two images the epipolar constraints relating them can be recovered. From these constraints the scale can be derived. It can be verified that the translation component orthogonal to the epipolar lines can be recovered. The translation component parallel to the epipolar lines cannot be determined from this equation but is estimated using one pair of corresponding points. The estimate improves as the error in the viewing direction diminishes. For the rotation components it can be easily shown that every rotation in space can be decomposed into a product of two rotations, a rotation around some axis that lies in the image plane followed by a rotation of the image around the optical axis. The image rotation can be compensated for by rotating the epipolar lines in the current image until they become parallel to the epipolar lines in the target image. Differences in the viewing direction, however, cannot be resolved from two images. This is the reason why structure from motion algorithms that assume an orthographic projection require at least three images to recover all the motion parameters [7, 16]. Although two images are insufficient to resolve the differences in viewing direction completely, the axis of rotation required to bring the robot to the target pose can still be recovered from the images leaving the angle of rotation the only unrecoverable parameter. Knowing the axis of rotation will allow us to gradually rotate the robot until its viewing direction will coincide with the target viewing direction. In addition, the direction of rotation is subject to a twofold ambiguity; namely, we cannot determine whether rotating to the right or to the left will lead us faster to the target orientation. In [1] we show that the axis of rotation is orthogonal to the epipolar lines. Thus, the possible viewing directions lie on a great circle on the viewing sphere which passes through the viewing directions of the target and current images. Therefore by rotating the camera parallel to the direction of the epipolar lines we can compensate for the differences in the viewing direction.
2.2 R e s o l v i n g the ambiguity We have been able to determine the great circle on the viewing sphere along which the robot should rotate. However, we have not determined which direction on the circle is the shorter of the two directions connecting the current and target viewing directions. To resolve this ambiguity we introduce a similarity measure that can be applied to the current and target images. While the robot is changing its viewing direction along the great circle we will evaluate the similarity between the images and see whether they become more or less similar. Using this information we will be able to determine if the robot is changing its viewing direction along the shortest path to the target viewing direction, or
14
Ronen Basri, Ehud Rivlin, and Ilan Shimshoni
if it is rotating in the other direction, in which case we can correct its rotation. This similarity measure should vary with a change in the viewing direction, but be invariant to scale changes, translation, and image rotation. The measure of similarity we have chosen is based on the apparent angles formed by triplets of points.in the current and target images. Figure 2.1 shows several examples of how apparent angles change as the viewing direction moves on a great circle. Given an angle ~5 in the scene and a great circle on the viewing sphere we denote the apparent angle as a function of the angle on the great circle O by r r has the following characteristics: it is a periodic function whose period is 2~r. Furthermore, r = -r + ~r). Also, r has a single maximum at some angle, Oread, and a single minimum, obtained at Omi~ = Oma~ + ~r. Finally, each angle between the maximum and minimum appears exactly twice in the function.
150
" .......... ".,..
I00
so .I >
-~].........~ 9
-50
-i00 -150
~ !
/
/ i
\ .............../ 0
50
lOO 150 200 250 A n g l e on G r e a t C i r c l e
300
350
Fig. 2.1. Three examples showing the effect of changing the viewing direction along a great circle on projected angles. Our measure of similarity is based on testing whether the apparent angles seen in the images taken by the robot are approaching the corresponding angles in the target i m a g e . In identifying the correct direction several issues have to be addressed. First, there exists a second viewing direction on the great circle which gives rise to the same apparent angle as in the target image. We call this direction a false target. Fortunately, it is not difficult to distinguish between the target and the false target because every angle in the scene gives rise to a different false target. Secondly, there exist "good" and "bad" sections on the great circle, where a section is considered "good" if when rotating in the correct direction along this section the apparent angle approaches its value in the target image. Figure 2.2(a) shows an example of r The thick segments denote the "good" sections of the great circle, and the thin segments denote the "bad" sections of the great circle. It can be seen that a "good" section exists around the target viewing direction, and as we get further away from the target "bad" sections appear. Consequently, suppose we consider the apparent angles in
Visual Homing: Surfing on the Epipoles
15
the current image, count how many of them approach the target, and use majority to decide on the correct direction then we are likely to make a correct choice near the target viewing direction. Our chances to be correct, however, deteriorate as we get away from the target. We therefore define a similar measure for the mirror image. Again, the great circle can be divided to "good" and "bad" sections, where now "good" sections are sections in which walking in the wrong direction will make the apparent angle approach the mirror image (Fig. 2.2(b)). This measure is likely to point to the correct direction in the neighborhood of the mirror image. 40
40 ximum
Im~
20
20 " ror
|
0
other
rror
9
.......................................................
|
' rot
other
0
rror
...........................
,a
other
target
arget
other
-20
-20
-30
-30
-40
arget
-40 SO
[. a ) x
arget
100 150 200 Angl .........
250 ,rcle
300
350
(b)
50
i00 150 200 Angl ........
250 ircle
300
350
Fig. 2.2. "Good" (thick lines) and "bad" (thin lines) sections with respect to the desired angle at the target (left) and mirror (right) images obtained while moving along a great circle on the viewing sphere. Since each of the two measures, the similarity to the target and mirror images, are reliable in different parts of the great circle we would like to use each of them at the appropriate range of viewing directions. We do so by checking which of the two measures achieves a larger consensus. T h e rationale behind this procedure is that for every angle in the scene each of the measures covers more than half of the great circle. Fig. 2.3 shows the result of a simulation which shows the percent of angles (and standard deviations) which point in the correct direction for a given viewing direction on the great circle using the target and the mirror angles. We have shown how we can estimate the motion parameters which separate the current pose of the robot from the target pose. T h e rotation of the image has been recovered completely. For the translation components in the image plane we have an estimate. However the rest of the parameters, the translation in depth (indicated by a scale change) and the change in the viewing direction were estimated only as a direction, while their magnitude, the distance in depth between the two images and the angular separation between the viewing directions were not determined. In the rest of this section we show how we can estimate the missing distances to the target pose. Estimating these missing distances will enable the robot to perform a smooth motion to the target by combining at every step a similar fraction of each of the motion components.
16
Ronen Basri, Ehud Rivlin, and Ilan Shimshoni -,'4 1.1
mirror - mirror+s .d. -.....
1 k 0.9
O. 7
~
0.6
~
0.5
~
(1.4
,~,
"%
0.8
-~
9
,%,
//
",~/"...,
0
50
i00
targ
----
,,~. target-~/~. ......
%
,:
' . ~ ' \ / ?'
150 200 250 300 Angle from target
350
400
Fig. 2.3. The plot shows the percent of angles (and standard deviations) which point in the correct direction for a given viewing direction on the great circle using the target and the mirror angles. We begin by deriving the component of translation along the optical axis from the scale changes. Suppose the scale between the current and the t a r g e t image is given by s, and suppose t h a t at the following step the scale becomes s'. It can be easily shown t h a t the n u m b e r of steps of this size is n =
s'/(s
-
s')
We estimate the angular separation between the current and t a r g e t viewing directions using a M a x i m u m Likelihood estimator of this angle which uses the percentage of angles which point to the correct direction (Figure 2.3). Details can be found in [1].
3. F u l l P e r s p e c t i v e
Homing
In this section we consider the problem of homing under perspective projection. Below we describe our m e t h o d for homing when the focal length of the camera is known. For this case we show how the motion p a r a m e t e r s can be recovered, and develop methods to resolve the ambiguity in the direction and recover the distance to the target position. In [1] we extend this formulation to the case t h a t the focal length is unknown. 3.1 H o m i n g w i t h a k n o w n focal l e n g t h Again, we wish to move a robot to an unknown target position and orientation S, which is given in the form of an image I of the scene taken from t h a t position. At any given point in time the r o b o t is allowed to take an image I ' of the scene and use it to determine its next move. Denote the current unknown position of the robot by S', our goal then is to lead the robot to S. Below we assume t h a t the same c a m e r a is used for b o t h the t a r g e t image and images
Visual Homing: Surfing on the Epipoles
17
taken by the robot during its motion, and that the internal parameters of the camera are all known. The external parameters, that is, the relative position and orientation of the camera in these pictures is unknown in advance. To determine the motion of the robot we would like to recover the relative position and orientation of the robot S ~ relative to the target pose S. Given a target image I taken from S and given a second image I ~ taken from S ~, by finding sufficiently many correspondences in the two images we estimate the motion parameters using the algorithm described in [5, 17], which is based on the linear algorithm proposed in [11, 15]. This algorithm requires at least eight correspondences in the two images. Other, non-linear approaches can be used if fewer correspondences are available [9]. The algorithm proceeds by first recovering the essential matrix E relating corresponding points in the two images. Once the essential matrix is recovered, it can be decomposed into a product of two matrices E = R T , the rotation matrix R and a matrix T which contains the translation components. The rotation matrix, which determines the orientation differences between the two images, can be fully recovered. The translation components, in contrast, can be recovered only up to an unknown scale factor. These recovered translation components determine the position of the epipole in the current image, which indicates the direction to the target position. In the next section we show how to determine whether the target position is in front or behind the current position of the robot. However we cannot determine the distance to the target position. After we recover the motion parameters we direct the robot to move a small step in the direction of the target. In addition, given the rotation matrix R we calculate the axis and angle of rotation that separates the current orientation of the robot from the target orientation and rotate the robot arm about this axis by a small angle. After performing this step the robot takes a second image. Using this image we recover the distance to the target position and use this distance to perform a smooth motion. 3.2 Resolving the ambiguity in the direction to t h e t a r g e t We have seen so far how given the current and target image the translation required to take the robot to the target position is indicated by the position of the epipole in the current image. However, using the epipole the direction to the target can be recovered only up to a twofold ambiguity, namely, we know the line which includes the two camera positions, but we do not know whether we should proceed forward or backward along this line to reach the target position. Below we show how by further manipulating the two images we can resolve this ambiguity. Using the current and target images we have completely recovered the rotation matrix relating the two images. Since a rotation of the camera is not affected by depth we may apply this rotation to the current image to obtain an image that is related to the target image by a pure translation. After
18
Ronen Basri, Ehud Rivlin, and Ilan Shimshoni
applying this rotation the two image planes are parallel to each other and the epipoles in the two images fall exactly in the same position. Denote this position by (v~, v~, f ) T . We may now further rotate the two image planes so as to bring both epipoles to the position (0, 0, f)T. Denote this rotation by Ro. Notice that there are many different rotations that can bring the epipoles to (0, 0, f)w, all of which are related by a rotation about (0, 0, f ) T . For our purpose it will not matter which of these rotations is selected. After applying Ro to the two images we now have the two image planes parallel to each other and orthogonal to the translation vector. The translation between the two images, therefore, is entirely along the optical axis. Denote the rotated target image by I and the rotated current image by F. Relative to the rotated target image denote an object point by P -- (X, Y, Z). Its coordinates in I are given by x=
fX Z '
Y=
fr Z
and its corresponding point (x t, ye, ] ) T E I s, x' --
fX Z+t'
y, =
fY Z+t"
t represents the magnitude of translation along the optical axis, and its sign is positive if the current position is in front of the target position, and negative if the current position is behind the target position. We can therefore resolve the ambiguity in the direction by recovering the sign of t. To do so we divide the coordinates of the points in the target image with their corresponding points in the current image, namely x xI
y yl
Z+t Z
t - 1 + -~.
This implies that t -- Z ( ~ - 1). Unfortunately, the magnitude of Z is unknown. Thus, we cannot fully recover t from two images. However, its sign can be determined since sign(t) = sign(Z)sign( ~ - 1).
Notice that since we have applied a rotation to the target image Z is no longer guaranteed to be positive. However, we can determine its sign since we know the rotation R0, and so we can determine for every image point whether it moved to behind the camera as a result of this rotation. Finally, the sign of x / x ~ - 1 can be. inferred directly from the data, thus the sign of t can be recovered. Since it is sufficient to look at a single pair of corresponding points to resolve the ambiguity in the translation we may compute the sign of t for every pair of corresponding points and take a majority to obtain a more robust estimate of the actual direction.
Visual Homing: Surfing on the Epipoles
19
3.3 R e c o v e r i n g t h e d i s t a n c e to t h e target To estimate the distance to the target position we let the robot move one step and take a second image. We then use the changes in the position of feature points due to this motion to recover the distance. Using the current and target images we have completely recovered the rotation matrix relating the two images. Since a rotation of the camera is not affected by depth we may apply this rotation to the current image to obtain an image that is related to the target image by a pure translation. Below we refer by I t and I " to the current and previous images taken by the robot after rotation is compensated for so that the image planes in I , I ~, and I " are all parallel. We begin by observing that any two images related purely by a translation give rise to the same epipolar lines. Given an image I and a second image I t which is obtained by a translation by t = (tx,ty,tz) T, notice first t h a t the two images have their epipoles in the same position. This is because the homogeneous coordinates of the epipole in I t are identical to t, while the homogeneous coordinates of the epipole in I are identical to - t . Consider now a point (x, y, f ) T E I, and its corresponding point (x t, yt, f ) T E Y ,
x' - / ( x
+
y, _ / ( Y +
Z + tz
'
Z + tz
Denote the epipole by (vx,vu) = ( f t x / t z , ftu/tz), it can be readily shown that both (x,y) and (x~,y ') lie on the same line through (vx, vy), since X I -- Vx
X -- V x
yt _ Vy
y - vy "
We turn now to recovering the distance to the target position. Given a point p = (x, y, f ) T E I, suppose the direction from the current image I t to the target position is given by t = ( t , , t y , t z ) T, and t h a t between the previous image I " and the current image the robot performed a step a t in that direction. Denote by n the remaining number of steps of size a t separating the current position from the target (so t h a t n -- 1/a). The x coordinate of a point in the target, current, and previous images are
fX x=--Z-,
x' =
f ( X +tx) x" f(X +(l +a)tx) Z + tz ' = Z + (l + a)tz
respectively. Eliminating X and Z and dividing by tz we obtain t h a t -
-
n---(X"
-- X')(X
-- Vx)"
The same computation can be applied to the y coordinate of the point. In fact, we can obtain a b e t t e r recovery of n if we replace the coordinates by
20
Ronen Basri, Ehud Rivlin, and Ilan Shimshoni
the position of the point along the epipolar line in the three images. (Thus, n is obtained as a cross ratio along this line.) Even though a single corresponding point is sufficient to determine the distance to the target position we can combine the information obtained from all the points to obtain a more robust estimate of the distance. Notice that this computation will amplify noise in the image when either Ix" - x' I or Ix - v~ I are small. Thus, the values obtained for points which introduce a significant change in position between the previous and current images and which their position in the target image is further away from the epipole are more reliable then points which introduce only a small change or points which appear close to the epipole in the target image.
4. Experimental
results
We have tested our homing algorithm under weak perspective on a thousand initial poses chosen at random. The algorithm converged successfully in all cases. Figure 4.1 shows the effect of uncertainty in the vertex position measured in the image on the convergence of the algorithm. Figure 4.1 (a) shows how the error in all the components of the pose converge to zero when there is no uncertainty. In Figure 4.1(b) the effect of uncertainty is shown. The uncertainty only effects the final stages of the algorithm when the error is very small. The algorithm converges more slowly until a solution is found.
....... .. .......
1 ....9" ............
0.5
~
.... . ................................. ....................
Tr~slatlon Tr~slation Direction
~ i n g V~wing viewing
'. ',
Direction Direction
Tr~*lation ~
1-2 .... 1 ........
Tr~slatlon I
2 -3 .....
Rotation Scale
...... ......
0.5
9
,- .............. "-.Viewing =-'-" "~<ewing .....9.viewing
~'. ............................ . . . . . . .
.
.
...
Rotatioa "', scale
i ....... 2 -3 .....
....... ......
o - 0 . So . . . . . . . . . . . . . .
~
Direction Direction Direction
.........
..L::-.,~..
~o.5
-1
-1 +a.s
-1.5
=2
(a )
~/
-2[
. . . .
~ t p ~ o e ,=. . . . . . . .
(b)
. . . . . . :t.o . . .N:~,r . . . . . .
4.1. The convergence of the components of the pose as the algorithm progresses. The pose is composed of seven components: the three Euclidean coordinates of the viewing direction, two components of the translation, the scale factor, and the image rotation. (a) No noise; (b) Noise level of 1%. Fig.
Figure 4.2 shows an example of applying the perspective procedure to simulated data in a noise-free and a noisy environment. As can be seen, in the noise-free example, the robot moved in the shortest path to the target while changing its orientation gradually until it matched the target orientation. Notice that since at the first step the robot could not yet estimate its distance to the target its first rotation differed from the rest of the rotations.
Visual Homing: Surfing on the Epipoles
l 2
l 4
l s
e
l 1
i 0
l 1
l 2
1
l 4
1
l 6
21
~ 1
8
~
5
10
15
20
25
30
Fig. 4.2. A simulation of homing under perspective projection. The solid line represents the distance of the robot from the target position, and the dashed line represents the angle separating the current orientation from the target orientation. Left: no noise. Right: Gaussian noise added to the pixel positions at every image. Finally, we mounted a CCD c a m e r a on a robot a r m ( S C O R B O T ER-9, from Eshed Robotec Inc.). The a r m was set in a target position and an image was taken (target, see Fig. 4.3(f)). T h e a r m was then set in another position, from which part of the target scene was visible (source, see Fig. 4.3(a)). The correspondences between the source and the target was provided m a n ually. Then, the algorithm described in Section 2. was run. We m a i n t a i n e d correspondences between successive frames by tracking the points using a correlation based tracking algorithm. We took twenty features so t h a t we can afford losing some of the features along the way (because of noise and occlusion) without impairing our ability to recover the epipolar constraints. In computing the epipolar lines at every step of the algorithm we used at least ten corresponding points using and applied a least squares fit. The different steps of the experiment are shown in Fig. 4.3(a)(h), where (a) is the source image, b-h the intermediate steps. The final image is shown in Fig. 4".4(a), note the similarity between it and the target image shown to its right(b). The joint values of the robot in its final position after the homing was completed were different, from target joint values, by less t h a n 1 ~ for the five revolute joints, and by less.than l c m for the linear shift bar.
Fig. 4.3. A run of an experiment with a six degrees of freedom robot: (a) The initial image; (b-f) Intermediate images; Top images: the images seen by the robot. Bottom: the position of the robot taken from a fixed camera.
22
Ronen Basri, Ehud Rivlin, and Ilan Shimshoni
Fig. 4.4. (a) The final image after homing was completed; (b) The target image. Top images: the images seen by the robot. Bottom: the position of the robot taken from a fixed camera.
5. C o n c l u s i o n s In this paper we have introduced a novel method for visual homing. Using this method a robot can be sent to desired positions and orientations specified by images taken from these positions. The method requires the pre-storage of the target image only. It then proceeds by comparing the target image to images taken by the robot while it moves, one at a time. Unlike existing approaches, our method determines the path of the robot on-line, and so the starting position of the robot is not constrained. Also, unlike existing methods, which are largely restricted to planar paths, our m e t h o d can send the robot to arbitrary positions and orientations in 3-D space. Nevertheless, a 3-D model of the environment is not required. Finally, our m e t h o d requires no memory of previous images taken by the robot. Thus, the m e t h o d uses minimal information and can deal also with a moving target.
Acknowledgement. Ronen Basri is an incumbent of Arye Dissentshik Career Development Chair at the Weizmann Institute. Ilan Shimshoni is supported in part by the Koret Foundation. This paper originally appeared in the 1998 IEEE Int. Conf. on Computer Vision [2]
References 1. R. Basri, E. Rivlin, and I. Shimshoni, Visual homing: surfing on the epipoles
Department of Computer Science Technical Report CIS9709, The Technion.
Visual Homing: Surfing on the Epipoles
23
2. R. Basri, E. Rivlin, and I. Shimshoni, Visual homing: surfing on the epipoles ICCV-98, Forthcoming. 3. R. Basri and E. Rivlin, Localization and homing using combinations of model views. A I 78:327-354, 1995. 4. G. Dudek and C. Zhang, Vision-based robot localization witout explicit object models. IEEE Int. Conf. on Robotics and Automation:76-82, 1996. 5. R. Hartley. In defense of the 8-point algorithm. ICCV-95:1064-1070, 1995. 6. J. Hong, X. Tan, B. Pinette, R. Weiss, and E. M. Riseman. Image-based homing. IEEE Control Systems:38-44, 1992. 7. T. S. Huang and C. H. Lee, Motion and Structure from Orthographic Projections. PAMI, 2(5):536-540, 1989. 8. L. L. Kontsevich, Pairwise comparison technique: a simple solution for depth reconstruction. Journal of Optical Society, 10(6):1129-1135, 1993. 9. E. Kruppa, Zur Ermittlung eines Objektes aus zwei Perspektiven mit innerer Orientierung. Sitz.-Ber. Akad. Wiss., Wien, Math. Naturw. Kl., Abt. Ila., 122:1939-1948, 1913. 10. C. H. Lee and T. S. Huang, Finding point correspondences and determining motion of a rigid object from two weak perspective views. CVGIP, 52:309-327, 1990. 11. H. C. Longuet-Higgins, A computer algorithm for reconstructing a scene from two projections. Nature, 293:133-135, 1981. 12. Y. Matsumoto, I. Masayuki and H. Inoue, Visual navigation using view-sequenced route representation. IEEE Int. Conf. on Robotics and Automation:83-88, 1996. 13. R. N. Nelson. Visual homing using an associative memory. DARPA Image Understanding Workshop:245-262, 1989. 14. L.S. Shapiro, A. Zisserman, and M. Brady. 3d motion recovery via affine epipolax geometry. IJCV, 16(2):147-182, October 1995. 15. R. Y. Tsal and T. S. Huang, Uniqueness and estimation of three-dimensional motion parameters of rigid objects with curved surfaces. PAMI, 6(1):13-27, 1984. 16. S. Ullman. The interpretation of visual motion. M.I.T. Press, Cambridge, MA, 1979. 17. J. Weng, T.S. Huang, and N. Ahuja. Motion and structure from two perspective views: Algorithms, error analysis, and error estimation. PAMI, 11(5):451-476, 1989. 18. J. Y. Zheng and S. Tsuji. Panoramic representation for route recognition by a mobile robot. IJCV, 9(1):55-76, 1992. 19. D. Zipser. Biologically plausible models of place recognition and goal location. In D. E. Rumelhart et al., Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 2, MIT Press:432-471, 1986.
Role of A c t i v e Vision in Optimizing Visual Feedback for R o b o t Control Rajeev S h a r m a Department of Computer Science & Engineering, The Pennsylvania State University 317 Pond Laboratory, University Park, PA 16802-6i06USA
S u m m a r y . A purposeful change of camera parameters or "active vision" can be used to improve the process of extracting visual information. Thus if a robot visual servo loop incorporates active vision, it can lead to a better performance while increasing the scope of the control tasks. Although significant advances have been made in this direction, much of the potential improvement is still unrealized. This chapter discusses the advantages of using active vision for visual servoing. It reviews some of the past research in active vision relevant to visual servoing, with the aim of improving: (1) the measurement of image parameters, (2) the process of interpreting the image parameters in terms of the corresponding world parameters, and (3) the control of a robot in terms of the visual information extracted.
1. I n t r o d u c t i o n The goal of visual feedback in robot manipulation is to help overcome uncertainties in modeling the robot and its environment, thereby increasing the scope of robot applications to include tasks t h a t were not possible without sensor feedback, for example welding. An active control over the imaging process can potentially increase the role visual feedback plays in robot manipulation. Here we look at some of the issues involved in making the c a m e r a active, while confining ourselves to mainly visual servoing, t h a t is, the positioning control of a robot from an initial to final position using visual feedback. This restricts the scope of our discussion to a lower level control, avoiding issues in active vision for reasoning at the higher, task level. Any vision-guided task carried out with a s t a t i o n a r y camera set faces inherent limitations, since there is no control of the imaging process. On the other hand, in order to best process the visual information the c a m e r a setting and position has to satisfy some fairly restrictive conditions. For example, the observed feature on an object m a y go out of the field of view, or out of focus, etc. Hence a c a m e r a set-up which can be dynamically changed during the course of a robot manipulation task should enhance the scope of b o t h the image processing, image interpretation, and the control of a robot using the visual feedback. The main incentive for using a controllable imaging s y s t e m would be an increase in the flexibility and reliability of the visual feedback and hence the degree of autonomy gained by the robot system involved in the servoing tasks.
Role of Active Vision in Optimizing Visual Feedback for Robot Control
25
Active vision refers to a purposeful change of camera parameters, for example, position, orientation, focus, zoom, to facilitate the processing of visual data. That is, active vision calls for a coupling between image acquisition and visual processing. The motivation for using active vision or more generally "active perception" can also be derived from biological vision systems that are known to be highly active and goal directed [8, 7, 9]. The basic idea is that active control of the geometric parameters of the sensory apparatus benefits a problem in various ways, for example, by transforming it from an ill-posed problem to a well-posed problem, or by constraining it in such a way that an efficient solution becomes feasible. Various forms of active control are employed--for example, tracking in [11], camera fixation in [1, 10], vergence control in [2], selective attention in [16], and control of focusing in [38]. Some of the research on active vision that could potentially benefit visual servoing are reviewed in this chapter. Although various issues in visual servoing has been addressed in the literature (see [21] for a review), it has only been recently that active vision has been considered in the context of visual servoing. However, the idea of automatically changing the parameters of a sensor (including a camera) for task-level planning of robot manipulation has been widely investigated in "sensor planning" research [59]. A good introduction on the different mechanisms of visual feedback involved in visual servo control can be found in [49]. In particular, an important distinction made is that of the feedback representation mode, which can be either position-based or image-based. See Figure 1.1 for a schematic overview of the two types of visual servo control loops. The position-based approach basically involves the problem of "visual reconstruction" which is recognized to be difficult because of the inherent non-linearities in the transformation and the uncertainties in the imaging process. On the other hand, image-based approaches require more work in task specification in the feature (or image or camera) space [30, 37, 49, 53, 65, 33]. This distinction between position-based and image-based control will play an important role for introducing active vision in the servo loop. The research in active vision for various 3D "recovery" problems in vision is particularly relevant to the position-based visual servoing, as will be shown later. Basically the goal of the visual feedback is to extract from the raw input array of intensity values a small set parameters that help in the control. It is hoped that active vision will have an influence over how the visual feedback provided is more -
-
-
relevant (by using attentional mechanism), accurate (by changing resolution, focus, etc. ), and reliable (to overcome modeling uncertainty, changes etc. ),
for a given visual servoing task. Some of these advantages will be discussed here.
26
Rajeev Sharma
(a) position-based control
(b) image-based control
Fig. 1.1. An overview of the position-based and the image-based servo control structures. The discussion will be organized under the following three groups corresponding to the particular aspect of visual servoing being addressed. - F e a t u r e E x t r a c t i o n : What role does active vision have in improving the measurement of relevant image features extracted from the raw image for visual servoing? This applies particularly to image-based approaches and also for the first stage of position-based visual servoing. F e a t u r e I n t e r p r e t a t i o n : How can active vision improve the process of relating the measured image features to the corresponding world parameters-with or without the models of the observed objects? This applies mainly to position-based image servoing. Visual C o n t r o l : How can active vision improve the different aspects of visual control? (e.g., by avoiding singular configuration, etc. ) This applies to both position-based and image-based visual servoing.
-
-
The rest of the chapter is organized as follows. The next section gives some relevant definitions and preliminaries for the discussion. Section 3. discusses the potential use of active vision for feature extraction--in particular, relating it to research in sensor planning. Section 4. surveys some of the past work in active vision for interpreting relevant 3D parameters that could be used for position-based visual servoing. Section 5. discusses the interplay of active vision with different visual servo control schema. Section 6. discusses some of the relevant issues for applying active vision to servoing tasks followed by concluding remarks in Section 7..
2. A c t i v e V i s i o n In order to capture the "best" image of a portion of a scene, various parameters of the image apparatus have to adjusted. Generally this is done manually, but for autonomous operation these parameters should be planned and adjusted actively. The following parameters capture the degrees of freedom involved in setting up an imaging apparatus. Other parameters would be involved in bringing about purposeful changes in the imaging that helps
Role of Active Vision in Optimizing Visual Feedback for Robot Control
27
with the visual processing. The camera parameters have been classified under different grouping: intrinsic vs extrinsic, geometric vs photometric etc. We just list the parameters here without defining them formally. -
For a single or monocular camera: focus, zoom, aperture, 3i) position,
3D orientation. -
For a binocular camera (stereo) setup, the additional parameters are: base-
line, vergenee. -
For a moving camera (Egomotion parameters): velocity, acceleration could also be involved in active vision.
One can consider the imaging parameter space that captures the variation of each of the above factors in the imaging apparatus. For example, the ninedimensional camera space referred to by Shafer [50]. For stereo setups and with additional camera egomotion parameters, the dimension of the imaging parameter space will be much higher. The goal of an active vision system would be to bring about changes in the imaging parameters such that it facilitates visual processing. This could be achieved in a variety of different ways--some of which will be discussed in the following sections. In terms of the imaging parameter space thus active vision is a tool to move about in the high dimensional space such that: (i) certain constraints are satisfied (similar to the task-level constraints for sensor planning), and (ii) performance is optimized with respect to a given criteria (e.g., contrast, feature motion, etc.). Some aspects of this "navigation" in the image parameter space are captured in terms of the following active vision primitives that can be called activities. Each of them is a convenient grouping of the motion in the imaging parameter space to achieve a desired change, and has been studied separately under different active vision contexts. -
-
-
-
-
G a z e C o n t r o l . This refers commonly to a change in the orientation of the camera. More generally it can refer to any purposeful change in both the position and orientation of the camera. A c t i v e T r a c k i n g . Moving the camera to maintain some invariance in the image of a certain world object. "Active" refers to the fact that the camera is actually being moved to track to distinguish it from passive tracking in the image. Further distinctions can be made (with analogy to human eye motion) for the tracking to be smooth or saccadic [54]. F i x a t i o n . Changing the gaze such that the optical axes of a stereo camera pair intersect at world point. This is generally achieved by cooperatively moving the corresponding image features in the two cameras to the image center. V e r g e n c e C o n t r o l . Cooperatively moving the fixation point of a stereo pair around in the scene. Focussing. Changing the effective focal length of the imaging setup to bring different scene regions in focus (or blur in case de-focussing is used.)
28 -
Rajeev Sharma E g o m o t i o n C o n t r o l . Changing the motion of the camera (e.g., by a known translation velocity) to help in recovering relative motion parameters.
The above represent only a partial list of active vision primitives (activities), but only some of these have been considered in the past under the context of visual servoing. Tracking has received particular attention in visual servoing, because of the nice invariance relation it presents that allows it to be incorporated in the servoing loop [47]. However, the other activities also have a potential to improve visual servoing, some of which will be considered in the following sections. The image systems required for testing active vision algorithms need to provide very accurate and repeatable changes in the camera parameters. In experimental systems, typically two approaches are taken for the control of the positional parameters of the camera. The first is to mount a camera on to a general purpose robot arm to give the eye-in-hand configuration. The second approach is to build special "heads" with typical degrees of freedom for controlling the pan, tilt, and position. Two such motor sets for a stereo pair produces additional degrees of freedom, e.g., vergence and baseline. The optical parameters of the individual cameras are then controlled through motors for changing the focus, aperture and zoom.
3. A c t i v e V i s i o n f o r F e a t u r e
Extraction
Unlike other sensors, computer vision involves several steps before the feedback can be applied for control. Assuming that the input is the raw intensity image (that is, not depth as in a laser range sensor), first the features have to identified which would then yield the feedback parameter. For example, if the relevant feature is a rectangle, first the rectangle has to identified and then the relevant parameter derived from the image of the rectangle, e.g., the x- and y- image coordinate of a corner point, the length of the diagonal, the area, etc. Note that this involves the segmentation problem (of separating the interesting feature from the rest of the image) as well as some understanding of the perspective projection on the image of the feature (for example, the image projection of the rectangle would in general be a quadrilateral). Active vision can potentially improve the process of feature extraction in several ways. Some of the issues involved in active vision for feature extraction are similar to the considerations involved in Sensor Planning for feature extraction. See Tarabanis & Tsai [58] for a survey of research involving sensor planning using a camera for the task of feature detection. Basically they survey the various implemented system that automatically set the parameters of a camera given a description of the task constraints. These constraints are categorized as being either geometric or radiometric [58].
Role of Active Vision in Optimizing Visual Feedback for Robot Control
29
The geometric constraints are: Visibility - Field-of-view - Depth-of-field Magnification or Pixel Resolution -
-
The radiometric constraints are: -
-
-
Illuminability Dynamic Range of Sensor Contrast
The goal of sensor planning is to find the camera parameters that satisfy all of the constraints provided by the task description [58, 22, 14, 32, 34, 40, 62, 3]. An active vision system could also be designed that changes its parameters to simply satisfy the constraints or to optimize some criteria when the constraints have been met. If description of the task is complete to the extent that the requisite constraints can be derived a priori then the sensor planning can be used to guide active vision. If the description of the task is incomplete and the sensor constraints cannot be derived automatically, then parameters have to be monitored by some other means, e.g., light detector for illumination control, tracking of a feature for gaze control, etc. Active vision is concerned with parameter adjustment during the execution of a task. The consideration for optimizing sensor parameters are all still applicablc but the focus here will be how these parameters can be varied or actively changed. However, the result in sensor planning can be extremely useful for active vision, for example, the same partitions of parameter values used in sensor planning can be used for switching from one region of a continuous parameter space to a n o t h e r - - d u r i n g the execution of a task, to dynamically achieve some feature extraction criteria. Of course, ideally the parameter variation should be continuous so that the camera parameters can be actively changed, for example, its orientation when mounted on a robot end effector. Much of the issues in active vision for feature extraction is subject of future research, since the chief concern by researchers in the past has been planning rather than reactively changing the camera parameters to improve feature extraction. When combining the effects of several different criteria for feature extraction, one approach would be to use a weighted sum of the different normalized measures, with the weights representing the importance of each aspect of feature extraction. Example of such a combination strategy was used in [29] for feature selection for visual servoing (See also [55]). Similar approach can be taken to guide an active vision system for aiding feature extraction under several optimizing criteria. When a particular set of image features is being used for visual servoing, extraction of such features is closely related to tracking. In this section only 2D image tracking is relevant, that is, tracking, without a 3D model of the
30
Rajeev Sharma
world or a particular set-up where the world feature is indeed moving in a plane perpendicular to the axis of the camera. In order to extract features in this manner the camera would have to be actively translating. This has some application in seam tracking or monitoring a conveyor belt with the camera mounted perpendicular to the moving features or actuator [18, 36]. For example, in [47] differential displacement in an image feature set is used to drive a 2D active tracking algorithm. The tracked objects are assumed to at known depths, thus no "interpretation" was involved. [42] gives another example of tracking in 2D using optic flow. Tracking of features moving with general motion is considered in the next two sections, since it involves interpretation and important aspects of control. Closely related to the visibility problem is the aspect of switching "attention" of the visual mechanism, when parts of the scene contain unknown objects. Much work has been done to address issues related to "where to look next?" (e.g., [48, 44]) The criteria used for selecting a point for fixating in the 3D world is task-dependent hence beyond the scope of the discussion here, but a recent survey of selective fixation control is given in [1]. Much work needs to be done in active vision with respect to feature extraction. The research in sensor planning consider some of the possible issues involved in feature extraction, but more consideration has to go into utilizing this research in building actual active vision systems that vary the camera parameters to aid feature extraction.
4. A c t i v e
Vision
for Feature Interpretation
In position-based visual servoing, the servoing is done on the relative position of the robot (end-effector) and an object in the world. Thus the feedback parameter is the relative position (and orientation) in the 3D world which has to be somehow extracted from the 2D image (or image set, or image sequence). Essentially this involves what is termed "reconstruction" in computer vision, and is recognized to be a hard problem. In many cases it can be shown to be mathematically intractable. In the case when it is theoretically tractable the vision algorithm runs into practical problems, for example, finding correspondences while using stereo algorithms. Active vision can play a very important role in improving the solution to the process of interpreting the image features (parameters) in terms of the corresponding world parameters. The issues that are addressed in bringing about such improvement are:
tractability, i.e., is the interpretation problem well-posed mathematically? stability, i.e., is the solution stable to small errors in the measurements? computational efficiency, i.e., can the interpretation done efficiently (in real time)?
- m a t h e m a t i c a l
-
Role of Active Vision in Optimizing Visual Feedback for Robot Control
31
Here we consider how active vision helps in extracting 3D world information from the image features using different techniques. We do not consider here the added predictability that would be gained by knowing that the parameter to be measured is also under the control of the system. This leads to integration of the active vision into the visual servo loop using good filtering models. We discuss this in the next section. Much of the active vision research, in fact, is concerned with how the individual activity influences the image interpretation, e.g., tracking on egomotion estimation, vergence control on relative depth estimation etc. We look at some of these inter-relations here that are relevant to visual servoing. A formal treatment of how active vision helps in some of the interpretation problems was given in [7]. It was contended that the goal of active vision is to manipulate the constraints underlying an observed phenomenon in order to improve the quality of the perceptual results. The improvement was in terms of mathematical tractability (an ill-posed problem becoming a well-posed), computational efficiency (a non-linear solution becoming linear), stability, etc. Several well-known problems in computer vision were then posed in terms of the active vision paradigm (using different forms of activity) to illustrate the improvement purely from mathematical considerations. The vision tasks considered were-shape from shading, shape from contour, shape from texture, structure from motion, and computation of optical flow. Since this seminal paper in active vision, progress has been quite rapid in addressing both the theoretical aspects of active vision algorithms as well as their practical implementation. See [12] for a collection of recent results. The literature on active vision for visual interpretation is quite vast. A good portion of this research is relevant to visual servoing--we review only a very small subset. An important parameter for position-based control would be the depth of a scene point for a given (camera) reference frame. However, the depth information can be extracted from a multiplicity of active vision cues, for example, stereo, vergence, focussing, etc. An important issues is how to integrate these multiple cues to derive the most reliable depth measure for servoing. [4] considers the issues of integratingmultiple depth cues from active vision. The active vision depth cues considered were--stereo disparity, vergence, focus, aperture and calibration. A related issue is how reliable is a measured position parameter derived from a particular active vision algorithm? This reliability is related to the tolerance of the algorithm to various kinds of modeling errors in the imaging system and process. A systematic consideration of the relative performance of different active visions cues for depth of a scene point is presented in [24]. The performance is measured in terms of the variance and mean of depth derived from each cue, in the presence of both r a n d o m errors (from non-ideal image formation and processing) and s y s t e m a t i c errors (from imperfect calibration of the different parts of the imaging system). The particular active vision
32
Rajeev Sharma
cues compared are parallel stereo, vergence, and focus. The variation of the performance with factors like baseline and range of the scene point was also demonstrated. When more than one active vision process is involved, they must cooperate to achieve the desired servoing task. Examples where this cooperation issues are addressed are [38, 20]. In [38] focus and stereo algorithms cooperative in interpreting the image, while cooperative gaze control for binocular vision is studied in [20]. [57] demonstrates the use of fixation to simplify the process of interpretation of motion parameters, using what was termed "direct robot vision" which is meant to imply that the active vision technique was able to avoid the correspondence problem as well the problem of computing optical flow. Tracking of 3D objects requires either an explicit or implicit interpretation process. A good body of literature exists in tracking of 3D objects using different models of motion and sensing errors. For example methods based on Kalman filtering, e.g., [43]. Another approach for tracking 3D objects was presented in [63]. It aimed at avoiding the limitations of long-range trajectory prediction methods by taking advantage of the spatiotemporal density of visual data using both a feature-based and a flow-based technique. A mathematical model of tracking was presented in [6]. Various advanced models have been developed for tracking, for example, snakes [35], or dynamic contours using B-splines [23, 15], or superquadrics [60]. Tracking is an important research area by itself and is also very important for the development of realtime active vision systems [12]. [5] gives an example of a real-time motion tracking using a stereo camera set-up and spatio-temporal filters. [28] aims at a more "qualitative" technique for target detection and tracking. [42] presents a tracking scheme using optical flow. Tracking of features in the world have been demonstrated to improve different interpretation problems, for example in reducing the complexity of structure-from-motion problem [7, 11]. Another example, [31] shows how tracking helps with the estimation of 3-D relative motion parameters. [43] shows the use of feature tracking to derive the structure of a scene from a motion sequence. Because the relative motion of a camera and the relevant world object induces a vector flow field called optical flow, flow-based active vision techniques could also be relevant for visual servoing. In motion interpretation, the goal is to recover relative motion parameters from the optical flow. However, knowledge of the camera motion (or egomotion) can be factored into the interpretation process resulting in active motion analysis techniques. Further, a purposeful change in the egomotion can also be used to benefit the motion interpretation process. Examples that use active motion analysis based on optical flow are presented in [51, 31, 64].
Role of Active Vision in Optimizin~ Visual Feedback for Robot Control 5. A c t i v e V i s i o n f o r V i s u a l
33
Control
In the previous two sections active vision was reviewed in the context of feature extraction and interpretation without considering the interplay of active vision and control. The only mention of control was made with a reference to the two alternative control structures, i.e., vision-based visual servo and position-based visual servo. The active vision issues were discussed essentially independent of the servo level control. In this section an attempt is made to study the relationship between active vision and robot control. Active Vision changes the feedback signal in the visual servoing control loop, hence it is important to study the relationship between an active vision strategy and the particular control scheme being used. At the same time active vision implies control of a mechanized devices and hence itself involves control issues. The latter is particularly important when the active vision is closely coupled with the servoing task, for example, tracking using an eye-inhand configuration. Motion of the camera could be the result of the motion of the controlled robot in a eye-in-hand configuration. The question from the active vision perspective would be how to take advantage of the motion of the camera. Of course, in the eye-in-hand configuration the disadvantage is the relative changes in the camera position is coupled with the actual task. Hence the active vision part has to be considered very closely with the control task. One can imagine the interplay between a motion just for helping with the visual processing and a motion that is part of the robot task. However, the key factor that distinguishes this form of active vision (also called controlled active vision [47]) is that the world parameters to be visually estimated for servoing (e.g., relative position of the robot end effector and an object on the scene) are themselves under the control of the system. This is in contrast with the active vision used for estimating completely unknown world parameter (which should perhaps be termed exploratory active vision, e.g., [39, 67] to make the distinction clear). Apart from the eye-in-hand configuration, control considerations are also important in designing active vision algorithms. Some of these considerations are also discussed here. [47, 46] presents a formulatibn of the object tracking problem as one that combines control with vision, calling it Controlled Active Vision. The term is meant to emphasize the fact that active vision is considered within a control schema. A set of discrete displacements of image features is fed to several different controllers--the PI Controller, Pole Assignment Controller, and a Discrete Steady-State Kalman Filter. Then the performance of the tracking is compared for each controller as a function of the type of vision algorithm used. It is claimed that the controlled active vision formulation leads to better tracking and servoing resulting in better active vision algorithms (that use tracking), for example, structure from motion, etc. (See Section 4.) The general insight is that combining noisy vision measurements with an appropriate control law leads to a better visual servoing.
34
Rajeev Shaxma
Similar emphasis on system modeling for visual servoing--for both modeling the motion parameters estimation as well as for developing control laws for tracking (using regulatory theory) is suggested in [41]. This work also compares the performance of position-based and image-based control structures for tracking using a stereo camera set-up. The simulation results presented give some intuitive conclusions, for example, under poor camera calibration image-based tracking is better than position-based tracking. This formal control framework would seem quite appropriate for incorporating active vision in a visual servo loop. [27] models vision as part of the control loop in terms of "interaction screws". This allows some of the desired results from automatic control theory (robustness, stability analysis, etc. ) to be applied. Careful modeling is also used in [13] for tracking using differential motion detected from vision, and applied for determining the accuracy and repeatability of a robot. Issues for incorporating vision in a dynamic feedback loop were also addressed in [30, 37, 65]; a resolved rate control schema was used in [30], a self-tuning controller was used in [37], while a model reference adaptive control schema was used in [65]. Recursive state estimation Kalman filtering is used for developing visionbased control for high speed autonomous vehicles in [25, 26]. The resulting predictive ability is used to direct processing within a given region. This represents a good example of using gaze control for visual servoing. The experimental results presented are a remarkable demonstration of the use of temporal "continuity" of image sequence exploited for image interpretation and control. [17] presents a control schema and implementation for a binocular active vision system through specifications of gains in parallel feedback loops. They use the term "directed vision" for the resulting active vision system and use biological analogies from the mammalian oculomotor control system [19]. An important issue for the control of a robot manipulator is the avoidance of singular configurations [66]. Consideration of singularities is also important in active vision and visual servoing. For example, in visual servoing it is important to be able to observe a differential motion of the image features that is being used for servoing. If the image features do not change despite changes in robot configurations, then a system may be in singular configuration. The position of the camera influences how the feature changes can be observed with a given change in the robot motion. In [52] a scalar measure called motion perceptibility is defined that can be used in evaluating a hand/eye set-up with respect to the ease of achieving vision-based control. Similary, in [45] proposed a vector resolvability measure for increasing the effective visual tracking region of an eye-in-hand system. These measures extend the notion of "manipulability" of a robotic mechanism in [66] to incorporate the effect of visual featuresand and are useful as a means of avoiding "singular" camera configurations. It can be used to guide the motion of an active camera system
Role of Active Vision in Optimizing Visual Feedback for Robot Control
35
zl'......................................
X
Fig. 5.1. The variation of the motion perceptibility measure with different camera positions for a 3DOF robot arm. using the criteria of motion perceptibility that allows for an improvement in the visual servo control. Figure 5.1 shows the variation of the motion perceptibility measure with the camera position for a 3-DOF PUMA-type robot. Clearly, one would like to design an active vision system that would avoids the configurations corresponding to low values of the motion perceptibility measure (near singularities) [56]. Thus the motion perceptibility provides a quantitative estimate of "goodness" of feature motion observation with respect to visual servo. Design of a control schema should consider the effects of incorporating active vision. For a camera with a fixed relationship with the controlled object (e.g., eye-in-hand robot configuration) and for a tracking task, incorporating active vision systematically in the control loop is feasible, as demonstrated in [47, 46]. Similar study is needed for incorporating active vision in a general visual servo task and with various types of active vision techniques, for example, focus, zoom, vergence, and motion control of camera set-up.
6. D i s c u s s i o n A great deal of research in active vision is concerned with aspects of high level perception and planning task, while visual servoing is concerned only with the low level feedback control of robot manipulator. An attempt was made to confine the discussion of active vision to what was most relevant to visual servoing, but this was somewhat difficult, given that so little work has been
36
Rajeev Sharma
done in active vision for visual servoing. By keeping some of the reference to active vision for higher level it is hoped that it will provide some framework for understanding the application of active vision to the servo level as well. There appears to be a close relationship between Sensor Planning and aspects of Active Vision dealing with feature extraction. Some of the similarity and differences were discussed in Section 3.. However, much work needs to be done to relate the two in practical terms. As sketched in the previous sections, there are many facets of active vision in terms of how it effects visual servoing. However, in a practical system the same camera (or camera set) would be used to execute some of the active vision during the execution of a particular task. Managing some of the conflicting activities would be necessary to maximize the effect of active vision. This is somewhat related to the management or organization of multiple behaviors (or agents) in an autonomous system research. We did not consider it here since it would be related to task-level considerations while here we have confined ourselves mainly to low level visual servoing. The exploratory aspects of active vision are not as significant here since we can exploit more constraints provide by the robot and workspace model. We mainly deal here with task monitoring and visual feedback for c o n t r o l - over which we have some predictive ability, thus the state space approach [47, 41] would seem most relevant for incorporating active vision into the visual servoing. On the other hand the surveyed active vision papers show us how the interpretation of the image stream could benefit from the knowledge of the controlled motion. More study is needed to relate much of the active vision research under the tighter conditions defined for visual servoing. Another issue to consider is the time delays introduced by the active vision (i.e., any motor activity not directly related to the task at hand) and the additional processing delays that could also be involved by introducing the active vision. Calibration is another difficult issue [61]. Calibrating an active vision system could introduce additional difficulty which needs to considered carefully. In general, the more degrees of freedom of the active camera system, the more the difficulty in controlling it to achieve the desired effect of aiding the visual servoing task. In the extreme case the control of the active vision system might become even more complicated that the original robot control problem, offsetting some of the advantages gained by making the camera active. On the other hand this provides an incentive for building more specialized robot heads with a high dynamic performance that will help in utilizing some of the potential advantages of active vision for visual servoing.
7. C o n c l u s i o n This chapter briefly reviewed some of the work on active vision that relates to visual servoing. There seems to be a wide scope for improving different
Role of Active Vision in Optimizing Visual Feedback for Robot Control
37
aspects of visual servoing with active vision. These potential improvements are discussed under the three categories of - feature extraction - feature interpretation - visual control Further distinctions were made based upon whether image-based or positionbased feedback mode is applied. A t t e m p t was made to relate the work on sensor planning with the problems in active vision. The survey of active vision for feature interpretation was confined to extracting depth (range) using active vision since that is most relevant to position-based visual servoing. Much of the potential of using active vision for visual servoing is still unrealized. Some of the open issues that stand in the way of realizing the full potential of active vision were also sketched.
References 1. A. L. Abbott. A survey of selective fixation control for machine vision. IEEE Control Systems, 12(4):25-31, 1992. 2. A. L. Abbott and N. Ahuja. Surface reconstruction by dynamic integration of focus, camera vergence, and stereo. In Proc. IEEE International Conference on Computer Vision, pages 532-543, 1988. 3. S. Abrams, P. K. Allen, and K. A. Tarabanis. Dynamic sensor planning. In Proc. IEEE International Conference on Robotics and Automation, pages 605610, 1993. 4. N. Ahuja and A. L. Abbott. Active stereo: Integrating disparity, vergence, focus, aperture, and calibration for surface estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1007-1029, 1993. 5. P. K. Allen, A. Timcenko, B. Yoshimi, and P. Michelman. Trajectory filtering and prediction for automatic tracking and grasping of a moving object. In Proc. IEEE International Conference on Robotics and Automation, pages 1850-1857, 1992. 6. J. Aloimonos and D. Tsakiris. On the visual mathematics of tracking. Image and Vision Computing, 9:235-251, 1991. 7. J. Aloimonos, I. Weiss, and A. Bandyopadhyay. Active vision. International Journal of Computer Vision, 1:333-356, 1988. 8. R. Bajcsy. Active perception. Proceedings of the IEEE, 78:996-1005, 1988. 9. D. H. Ballard and C. M. Brown. Principles of animate vision. CVGIP: Image Understanding, 56:3-21, 1992. 10. D. H. Ballard and A. Ozcandarli. Eye fixation and early vision: Kinetic depth. In Proc. IEEE International Conference on Computer Vision, pages 524-531, 1988. 11. A. Bandopadhay and D. H. Ballard. Egomotion perception using visual tracking. Computational Intelligence, 7:39-47, 1991. 12. A. Blake and A. Yuille. Active Vision. MIT Press, Cambridge, MA, 1992. 13. M. E. Bowman and A. K. Forrest. Visual detection of differential movement: Applications to robotics. Robotiea, 6:7-12, 1988. 14. A. Cameron and H. Durrant-Whyte. A bayesian approach to optimal sensor placement. International Journal of Robotics Research~ 9(5):70-88, 1990.
38
Rajeev Sharma
15. R. Cipolla and A. Blake. The dynamic analysis of apparent contours. In Proc. IBEE International Conference on Computer Vision, pages 616-623, 1990. 16. J. J. Clark and N. J. Ferrier. Modal control of an attentive vision system. In Proc. IEEE International Conference on Computer Vision, pages 514-519, 1988. 17. J. J. Clark and N. J. Ferrier. Attentive visual servoing. In A. Blake and A. Yuille, editors, Active Vision, pages 137-154. MIT Press, Cambridge, MA, 1992. 18. W. F. Clocksin, J. S. E. Bromley, P. G. Davey, A. R. Vidler, and C. G. Morgan. An implementation of model-based visual feedback for robot arc welding of thin sheet steel. International Journal of Robotics Research, 4(1):13-26, 1985. 19. H. Collewijn and E. Tamminga. Human smooth and saccadic eye movements during voluntary pursuit of different target motions on different backgrounds. Journal of Physiology, 351:217-250, 1984. 20. D. J. Coombs and C. M. Brown. Cooperative gaze holding in binocular vision. IEEE Control Systems, 11(4):24-33, 1991. 21. P. I. Corke. Visual control of robot manipulators--a review. In K. Hashimoto, editor, Visual Servoing, pages 1-32. World Scientific, 1993. 22. C. G. Cowan. Automatic sensor placement from vision task requirements. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10:407-416, 1988. 23. R. Curwen and A. Blake. Dynamic contours: Real-time active splines. In A. Blake and A. Yuille, editors, Active Vision, pages 39-58. MIT Press, Cambridge, MA, 1992. 24. S. Das and N. Ahuja. A comparative study of stereo, vergence, and focus as depth cues for active vision. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 194-199, 1993. 25. E. D. Dickmanns and W. Graefe. Applications of dynamic monocular machine vision. Machine Vision and Applications, 1:241-261, 1988. 26. E. D. Dickmanns, B. Mysliwetz, and T. Christains. An integrated spatiotemporal approach to automatic visual guidance of autonomous vehicles. IEEE Transactions on Systems, Man, and Cybernetics, 20:1273-1284, 1990. 27. B. Espiau, F. Chanmette, and P. Rives. A new approach to visual servoing in robotics. IEEE Transactions on Robotics and Automation, 8:313-326, 1992. 28. B. B. et al. Qualitative target motion detection and tracking. In Proc. DARPA Image Understanding Workshop, pages 370-398, 1989. 29. J. T. Feddema, C. S. G. Lee, and O. R. Mitchell. Weighted selection of image features for resolved rate visual feedback control. IEEE Transactions on Robotics and Automation, 7:31-47, 1991. 30. J. T. Feddema and O. R. Mitchell. Vision-guided servoing with feature-based trajectory generation. IEEE Transactions on Robotics and Automation, 5:691700, 1989. 31. C. Fermuller and Y. Aloimonos. Tracking facilitates 3-d motion estimation. Biological Cybernetics, 67:259-268, 1992. 32. G. D. Hager. Task Directed Sensor Fusion and Planning. Kluwer Academic Publishers, 1990. 33. K. Hashimoto, T. Kimoto, T. Ebine, and H. Kimura. Manipulator control with image-based visual servo. In Proc. IBEE International Conference on Robotics and Automation, pages 2267-2271, 1991. 34. S. A. Hutchinson and A. C. Kak. Planning sensing strategies in a robot work cell with multi-sensor capabilities. IEEE Journal of Robotics and Automation, 5(6):765-782, 1989. 35. M. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321-331, 1987.
Role of Active Vision in Optimizing Visual Feedback for Robot Control
39
36. P. K. Khosla, C. P. Neuman, and F. B. Prinz. An algorithm for seam tracking applications. International Journal of Robotics Research, 4(1):27-41, 1985. 37. A. J. Koivo and N. Houshangi. Real-time vision feedback for servoing robotic manipulator with self-tuning controller. IEEE Transactions on Systems, Man, and Cybernetics, 21:134-142, 1991. 38. E. P. Krotkov. Active Computer Vision by Cooperative Focus and Stereo. Springer-Verlag, Berlin, 1989. 39. K. N. Kutulakos and C. R. Dyer. Recovering shape by purposive viewpoint adjustment. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 16-22, 1992. 40. C. Laugier, A. Ijel, and J. Troccaz. Combining vision-based information and partial geometric models in automatic grasping. In Proc. IEEE International Conference on Robotics and Automation, pages 676-682, 1990. 41. M. Lei and B. K. Ghosh. Visually-guided robotic motion tracking. In Proc. Thirtieth Annual Allerton Conference on Communication, Control, and Computing, pages 712-721, 1992. 42. R. C. Luo and R. E. M. Jr. A modified optical flow approach for robotic tracking and acquisition. Journal of Robotic Systems, 6(5):489-508, 1989. 43. L. Matthies, T. Kanade, and R. Szeliski. Kalman filter-based algorithms for estimating depth from image sequences. International Journal of Computer Vision, 3:209-236, 1989. 44. P. McLeod, J. Driver, Z. Dienes, and J. Crisp. Filtering by movements in visual search. Jrn. Experimental Psychology: Human Perception and Performance, 17(1):55-64, 1991. 45. B. Nelson and P. K. Khosla. Strategies for increasing the tracking region of an eye-in-hand system by singularity and joint limit avoidance. International Journal of Robotics Research, 14(3):255-264, 1995. 46. N. P. Papanikolopoulos and P. K. Khosla. Adaptive robot visual tracking: Theory and experiments. IEEE Transactions on Automatic Control, 38(3):429445, 1993. 47. N. P. Papanikolopoulos, P. K. Khosla, and T. Kanade. Visual tracking of a moving target by a camera mounted on a robot: A combination of vision and control. IEEE Transactions on Robotics and Automation, 9(1):14-35, 1993. 48. R.D. Rimey and C. M. Brown. Controlling eye movements with hidden markov models. International Journal of Computer Vision, 7(1):47-65, 1991. 49. A. C. Sanderson and L. E. Weiss. Adaptive visual servo control of robots. In A. Pugh, editor, Robot Vision. IFS Publications, Bedford, UK, 1983. 50. S. A. Shafer. Automation and calibration for robot vision systems. Technical Report CMU-CS-88-147, Carnegie Mellon University, 1988. 51. R. Sharma and Y. Aloimonos. Early detection of independent motion from active control of normal image flow patterns. IEEE Transactions on Systems, Man, and Cybernetics, 26(1):42-52, February 1996. 52. R. Sharma and S. Hutchinson. Motion perceptibility and its application to active vision-based servo control. IEEE Transactions on Robotics and Automation, 13(4):607-617, August 1997. 53. R. Sharma and H. Sutanto. A framework for robot motion planning with sensor constraints. IEEE Transactions on Robotics and Automation, 13(1):6173, February 1997. 54. N. Srinivasa and R. Sharma. Execution of saccades for active vision using a neurocontroller. IEEE Control Systems, 17(2):18-29, April 1997. 55. H. Sutanto and R. Sharma. Global performance evaluation of image features for visual servo control. Journal of Robotic Systems, 13(4):243-258, April 1996.
40
Rajeev Sharma
56. H. Sutanto, R. Sharma, and V. K. Varma. The role of exploratory movement in visual servoing without calibration. Robotics and Autonomous Systems, 1998. (to appear). 57. M. A. Taalebinezhaad. Direct robot vision by fixation. In Proc. I E E E International Conference on Robotics and Automation, pages 626-631, 1991. 58. K. Tarabanis and R. Y. Tsai. Sensor planning for robotic vision: A review. In O. Khatib, J. J. Craig, and T. Lozano-Pdrez, editors, Robotics Review 2. MIT Press, Cambridge, MA, 1992. 59. K. A. Tarabanis, P. K. Allen, and R. Y. Tsai. A survey of sensor planning in computer vision. I E E E Transactions on Robotics and Automation, 11:86-104, 1995. 60. D. Terzopoulos and D. Metaxas. Dynamic 3d models with local and global deformations: Deformable superquadrics. I E E E Transactions on Pattern Analysis and Machine Intelligence, 13(7):703-714, 1991. 61. R. Y. Tsai. Synopsis of recent progress on camera calibration for 3d machine vision. In Robotics Review 1. MIT Press, Cambridge, MA, 1989. 62. R. Y. Tsai and K. Tarabanis. Occlusion-free sensor placement planning. In H. Freeman, editor, Machine Vision, pages 301-339. Academic Press, 1990. 63. G. Verghese, K. L. Gale, and C. R. Dyer. Real-time motion tracking of threedimensional objects. In Proc. I E E E International Conference on Robotics and Automation, pages 1998-2003, 1990. 64. D. Vernon and M. Tistarelli. Using camera motion to estimate range for robotic parts manipulation. I E E E Transactions on Robotics and Automation, 6(5):509521, 1990. 65. L. E. Weiss, A. C. Sanderson, and C. P. Neuman. Dynamic sensor-based control of robots with visual feedback. I E E E Journal of Robotics and Automation, 3:404-417, 1987. 66. T. Yoshikawa. Analysis and control of robot manipulators with redundancy. In Robotics Research: The First Int. Symposium, pages 735-747. MIT Press, 1983. 67. J. Y. Zheng, Q. Chen, and A. Tsuji. Active camera guided manipulation. In Proc. I E E E International Conference on Robotics and Automation, pages 632638, 1991.
A n Alternative A p p r o a c h for I m a g e - P l a n e Control of R o b o t s Michael Seelinger, Steven B. Skaar, and M a t t h e w Robinson Department of Aerospace and Mechanical Engineering Fitzpatrick Hall of Engineering University of Notre Dame Notre Dame, Indiana 46556-5637 USA
S u m m a r y . The article distinguishes between camera-space manipulation and visual servoing; it identifies application categories which particularly favor the former. Using a 2-dimensional, 2-degree-of-freedom example, a point of convergence between the two methods is identified. This is based upon an extension of visual servoing wherewith a Kalman Filter is used to estimate the ongoing visual-error signal from discrete-time, noisy, delayed and possibly intermittent visual input combined with continuous joint-rotation input. A working, nonholonomic, visionbased control system which applies a similar estimation strategy is introduced in order to illustrate and make plausible this extension, and to clarify the comparison. The comparison is discussed in terms of advantages/disadvantages accrued in the restoration of the two methods to their usual forms from their respective extensions or modifications as required to produce convergence.
1. I n t r o d u c t i o n Considerable research is directed toward the use of calibration m e t h o d s for achieving visual guidance of robots [36, 53, 54, 58]. Such methods entail two separate calibration events which are equally critical for m a n e u v e r success: 1. calibration of cameras relative to a fixed, physical "world" coordinate system [6, 31, 52, 58]; and 2. calibration of the robot kinematics relative to the physical coordinate system [7, 16, 33]. Thus the imaging and the m a n i p u lation steps are separated. After workpiece location, as a separate activity, the internal coordinates of the robot are driven to a terminus based u p o n an inverse-kinematics calculation. As early as the mid 1970's (e.g. Tani et al. [50]) it was noted t h a t the imaging and manipulation steps need not be separated - t h a t organisms provide an example of hand-eye coordination which certainly is based upon pursuing maneuvers in terms of "sensor-space" success, without reference to any absolute coordinate system. Thus the idea of "vision feedback" was born. In this approach, errors as perceived in the sensor reference f r a m e itself - disparities between the actual and desired sensor-frame locations of key features - are used in a control law in order to drive the key features toward a zero-image-error state. Some knowledge of the partial sensitivity of image response to increments in each joint rotation is required, but terminal success tends to be robust to errors in the needed m a t r i x Jacobians.
42
Michael Seelinger, Steven B. Skaar, and Matthew Robinson
In Corke [14], Sanderson and Weiss [39], and Rives et al. [37], diverse forms of implementation of this idea are put forth. The name changes, from "vision feedback" to "visual servoing", (VS), but the strategy - implicit in both names - of driving image-plane errors toward zero using inputs as determined through matrix Jacobians continues to be pursued vigorously [1, 10, 11, 13, 15, 17, 18, 25, 28, 34, 35, 43, 51]. Each of the various forms of VS has advantages or disadvantages [14, 27] depending upon the application which is being addressed. The present article argues that there are certain applications for which a different approach to the pursuit of image objectives is more effective t h a n VS. This alternative, introduced in 1986 [44] and subsequently developed [4, 12, 21, 22, 23, 24, 29, 32, 40, 41, 42, 45, 46, 47, 48, 49], is effective in situations where, during maneuver execution, the participating, uncalibrated cameras can be counted upon to remain stationary between the time of visual access to the workpiece (prior to any blockage) and the time of maneuver culmination, and where the workpiece also remains stationary during this interval. The advantage of this alternative - called "camera-space manipulation" (CSM) is especially pronounced where maneuver closure itself can cause significant obscuration to cameras of the visual error prior to maneuver culmination. Also certain practical realities associated with artificial vision - coarseness of the discretized image, as well as computational intensity of image-feature identification/location - tend to favor CSM with its application of new visual data to input-output estimation rather than to simple feedback. The comparison between VS and CSM, and the rationale for favoring the latter, is explored by identifying a point of convergence of the two methods. This convergence occurs in a two-dimensional context where visual-errorfeedback estimation is introduced into a visual-servoing control strategy, on the one hand, and where the six estimation parameters of CSM are reduced to a particular two, on the other. By identifying this point of convergence, the two approaches are compared based on accrued advantages/disadvantages as the respective methods are returned from the limits of convergence to their usual forms. The contrast is drawn further as one extends the comparison from the two-dimensional case to three dimensions.
2. A c c e s s i b i l i t y
-
of visual
error
near
the
terminus
Note that VS generally requires ongoing access and sampling of visual error the difference between an image-plane target and the positioned body [27, 57] which may not be available due to various kinds of obscuration during the critical period of closure. Fig. 2.1 illustrates such situations of imageplane-error obscuration with tasks that have been conducted successfully, with precision and robustness, using CSM. In each case visual access either to the workpiece/target, or the positioned object, or both, becomes obscured
An Alternative Approach for Image-Plane Control of Robots
43
Fig. 2.1. (a) Wheel loading, (b) Pallet stacking, (c) Etching. prior to closure. Digital video of the three tasks being performed using CSM can be viewed on the WWW. 1 In addition to the prohibitive matter of image-plane-error obscuration by the robot/load, there are other reasons to consider CSM for the above classes of problems, reasons which are outlined below.
3. I m a g e - p l a n e
control: feedback
vs. estimation
For many control systems it is effective to "servo", or to adjust input to the "plant" based upon error which is detected in an ongoing, "real-time" way. This has proven to be robust and effective in countless applications. However, there may be instances where it is more effective not to use incoming samples to "servo" but rather to identify and refine the input-output relationship of the plant, and essentially, using current best estimates of this relationship, to drive the plant forward open loop to its target state(s). We argue that there are cases, such as those discussed above, where the visual control of robots is better effected by this latter use of incoming information. In particular, five criteria are put forth below which indicate a preference for the systemi.d./open-loop-control approach: 1. Unreliability of the ongoing error signal characterizes the visual feedback. 2. A zeroeth-order (algebraic) input-output relationship characterizes the plant which, in the local operating region, is essentially static and noisefree. 3. The visual-information signal on the basis of which feedback (for VS) - or system i.d. (for CSM) - would be based is noisy, with large-rms noise. 4. The input-output relationship is highly observable using the breadth and frequency of samples which are typically available during operation. 5. Substantial time delay may be associated with the maneuver-specific information used, in the or~e case, for feedback, and in the other case, for estimation. The two-dimensional, two-degree-of-freedom case of Fig. 3.1(a) obscures some of the issues pertaining to the more general 3D, rigid-body-positioning 1 http://www.nd.edu/NDInfo/Research/-sskaar/Home.html
44
MichaelSeelinger, Steven B. Skaar, and Matthew Robinson
Fig. 3.1. (a) 2-degree-of-freedom system for 2-dimensional point positioning, (b) System physical coordination at t ----t4, (c) Camera-space sequence of appearances of point A. case; however, it does distinguish between VS and CSM. Also, certain of the issues which arise in the broader problem can be considered in terms of extension from the present, much simpler, problem. The objective of the system of Fig. 3.1(a) is to bring to physical coincidence the positioned point, A, with the target point, B. It is assumed that both of these points lie in the plane of the robot's workspace, and that the two rotations, 81 and 82, are available to locate point A over a broad extent within this plane, including the point B. It is assumed further that a single camera is directed at the system, and that image-processing software is able to identify point A and point B, and to locate them in the two-dimensional image plane or "camera space". As indicated in Fig. 3.1(b), corresponding with each intermediate position of point A will be a value of the "joint-coordinate vector", (0_} = [81 ~2]T. In the figure, four particular values of (0_} are identified; these correspond with four instants: tl, t2, t 3 , and the "current" instant, t 4 . Both VS and CSM are concerned with the positions of points A and B in the image plane of the participating camera. The coordinates (X__~) = [xc yc] T in the image of point A at the four instants, tl - t 4 , are indicated in Fig. 3.1(c) a s (X__.c(tl)}, {X___c(t2)}, (X___c(t3)}, and, for the current instant, (Xc(t4)}. Both CSM and VS take as their basic objective satisfaction of the maneuver requirement in the reference firame(s) of the visual sensor(s), that is, in the context of the present problem, the bringing to coincidence of positioned point A with point B in the camera space - which is indicated, or implied in Fig. 3.1 (c), by the computer-monitor image. These methods differ, however, in the means by which this objective is reached.
4. V i s u a l S e r v o i n g If t = ta is the current instant, the joint rotations would be determined based on the current camera-space error, i.e. the vector {e} -- {X B - X(t4)}, as illustrated in Fig. 4.1(a). The commanded joint velocities might be based upon {e_e_}according to a control law given by [14, 18, 27]
An Alternative Approach for Image-Plane Control of Robots
45
A
tYc
el + e2
(a)
(b)
Fig. 4.1. (a) Camera-space error vector at the current instant, t = t4, (b) Nominal forward kinematics of the manipulator. {at(t)}
= Kp [J(_8(t))] -1 {e__(t)}
(4.1)
where Kp would be a proportional control gain and elements of {/~r} are the desired, or reference, values of the vector of joint-coordinate velocities {~} = d{8__}/dt which are passed in real time as separate inputs to the jointlevel servomechanisms. (Various forms of VS use outputs from the cameraspace control law differently [14, 27].) The Jacobian [J] is some established, generally {8_}-dependent matrix which relates increments in the joint-rotation elements to corresponding increments in the camera-space location of point A. Typically [J(_8)] is found by using a combination of some approximate knowledge of the camera position and the camera/frame-grabber mapping of physical position into camera-space position, combined with the nominal forward kinematic model of the robot, in this instance, as this model pertains to point A [8, 27]:
XA = L1 cos (01) + L2 cos (St + 82) = f~(81,82)
(4.2)
YA = L1 sin (81) + L2 sin (81 + 8~) = fy(81,82)
(4.3)
where LI and L2 are as defined in Fig. 4.1(b). The required Jacobian of Eq. (4.1) might be determined using a model of the mapping between physical space, {x} = [x y z] T, and camera space, {X_~} = [Xc yc] T, e.g.: {X c} = {g(x, y, z; D) } where, for a given camera/lens/frame-grabber combination, the vector of parameters {D} could include as few as six elements as required to capture the position/orientation of the camera in relation to the manipulator base. With these parameters determined, the Jacobian can be found according to: 0g
where, for this 2-dimensional robot, {_f(O_)}= [Ix fy O]T.
46
Michael Seelinger, Steven B. Skaar, and Matthew Robinson
This general strategy, while it does depend upon a prior characterization of the matrix [J(0)], and is therefore in some sense dependent upon calibration, has been shown to converge smoothly to zero error even if [J(0)] is rather badly mischaracterized [2, 14].
5. C a m e r a - s p a c e
manipulation
Like VS, CSM seeks to realize maneuver objectives directly in the reference frame(s) of the visual sensor(s). Also like VS, CSM exploits some prior knowledge of the manipulator kinematics, specifically the model (f(0)}, as well as the model, {g(x, y, z; D)}, whereby points in physical space are, for the vision system in use, projected from physical space into camera space. However, rather than computing a reference angular velocity for the joint rotations, {s , at the current instant only, CSM computes an entire joint-coordinate trajectory plan {Or (t)} through to the maneuver terminus (assuming that the task has such a single terminus; for certain tasks, where a particular, continuous, seamless, tightly controlled path is required, such as the etching task of Fig. 2.1 (c), the present strategy is extended to a smooth and continuous realization of a whole sequence of target points corresponding to the full required path). This trajectory {0,-(t)} is passed along to the joint-level controller of the robot system, although it may, until close to the end of the motion, be replaced due to the availability of newly processed visual information at various junctures t along the way by an "upgraded" reference path for the "time to go" of the current maneuver. Of course it is important, at those points in time where transitions in the reference joint trajectories occur due to newinformation-based reference-trajectory "upgrades", that these transitions are adequately smooth. To understand this strategy, consider the current instant, td, of the 2DOF manipulator as shown in Fig. 3.1(b). At this instant the arm is moving toward its goal and a visual sample is acquired; but because image analysis will require some time, the arm continues to proceed toward the maneuver goal using information acquired, and a joint-level trajectory plan (0~(t)} calculated, prior to td. According to Fig. 3.1, the information acquired prior to ta includes: {X__c(tl)}, {X__c(t2)}, {it(t3)} , along with corresponding, simultaneously measured (via joint encoders), {0(tl)}, {0(t2)}, {0(ts)}. If the sample acquired at ta becomes processed and available prior to the time of the end of the maneuver, tl, then this new information will be used to update (0r(t)} beginning at some future time t~, where t4 < t~ < tl. For now, assume that t3 is sufficiently separated from ta that, some time between these two instants, call it t~, a {0r(t)} which takes into account the information acquired through t3 has been computed and is in effect at t = td. If the robotic system as delivered from the factory, given a single, reasonably smooth (0r(t)}, is stable, and if the updates of {0r(t)} are infrequent,
An Alternative Approach for Image-Plane Control of Robots
47
smooth-transitioning from previous {8_r(t)}, and involve only slight adjustments in the path-to-go, then stability of the CSM system described above is taken for granted. This has been demonstrated in the uniformly stable performance of myriad physical tests [12] (such as those presented in video under the URL given above) performed with commercial robots using this strategy. The basis on which the trajectory plans {_St(t)} are made and updated is estimation of the input/output relationship where input is {8_} and output is {X_~}, using the "locally" (in joint- and camera-spaces, as well as time) most relevant information available. Clearly, this relationship, assuming static cameras, a static manipulator base, and a holonomic robot [49] will be algebraic rather than merely differential. Being model-based, the estimation procedure should make use of the geometric structure of the robot and camera models. The strategy is based upon a straightforward combination of the nominal forward kinematics, for our example robot, given by [XA YA ZA] T -'~ {--f(~)} ---[fx (~) fy(~) 0]T, and, for {g_(x, y, z; D)}, the pinhole, or perspective, camera model [26, 58]. Substituting the one into the other gives a relationship directly between {8_8}and {X_Xc};for the 2D robot: {X__c}= {g(f=(0_), fy(8__), 0; D)} -- {h(8_; D)}
(5.1)
In principle, the six-element vector {D} could be determined by minimizing over all {D} r(D)
=
D) -
r
{h(e(ti); R) -
(5.2)
i where each 2 • 2, symmetric matrix [Wi] is adjusted to give higher "weight" to those samples acquired nearest to the terminus of interest. However, problems of numerical stability associated with the combination of the pinhole camera model and the typically confined span or extent of the applied samples have necessitated, in practice, an alteration in this strategy: the introduction of an intermediate form of estimation model as described in Gonzalez et al. [24]. Furthermore, it has been observed that good estimates require some reasonable initial diversity of samples, such that the samples acquired at tl, t2 and t3 of Fig. 3.1, for example, would need to be supplemented by perhaps 10 samples acquired earlier in a "preplanned trajectory" - using motion not directly pertinent to the current maneuver - over a larger breadth of {0_} and {X_~} [46, 47]. This latter requirement is perhaps comparable to the need, using VS, to preestablish a [J(0_)] matrix; with CSM, once the supplementary samples are acquired, in a given day, for example, they can continue to be used to minimize F ( D ) for several tasks; in these tasks, the information from the early samples is supplemented by heavily weighted samples pertaining to the current, actual maneuver and approach. Due to the problem of numerical stability, an alternative form of estimation model {h(0;D)} of the "camera-space kinematics", called here
48
Michael Seelinger, Steven B. Skaar, and Matthew Robinson
{h* (0; C)}, has been used [46, 48] in practice. Estimation of the six elements of{C_-}of {ht(0-; C)}, however, is much the same as the above in that it is based upon weighted local samples by minimizing over all {Q):
i
(Both {h t (0_;C_C__}) and {h(0_; D D) } are nonlinear estimation models - {h t (0_;C__)} nonlinear in both {0_} and {C}, and {h(0_; D)} nonlinear in both {0_} and {D__}; however, the simpler form o f {h__t } results in better numerical performance. Furthermore, the camera-space samples are replaced in Eq. (5.3) with modified or "flattened" samples {x_~t(ti)} in such a way as to produce the identical effect at t I as that which would be realized with direct use of the pinhole model h [48]). The advantages of applying a general model directly to the cameraspace/joint-input pairs is apparent: Leaving as estimation parameters in the model those six quantities {C} which account for camera position/orientation relative to the manipulator, and giving preferred weighting to local samples, results in the ability to compensate for most manipulator-kinematic or camera/physical-space model errors virtually completely. So long as the distance-to-go, in joint and camera space, is reasonably short, the open-loop trajectories which are calculated in order to reach the target are very accurate; for the present example the required terminal vector of reference joint rotations {0_r (t f ) } - the joint-rotation vector that would result in physical coincidence of points A and B - would be found, using the most recent estimates of {C}, by solving:
(x_B} : { h2(0-;_c)
(5.4)
Then a suitable reference {0_~(t)} would be computed which would connect an earlier, actual robot pose with the computed, terminal {0_~(tl)}. Referring to the updates of the entire reference path, as discussed above, it first should be noted that, in practice, each new update, based on morerecent, more locally applicable, information (e.g. {0_(ta)}, {X(t4)}) has little effect on {0_r(tl)}. There are two reasons for this: 1. The new sample tends to be very consistent with the previous estimates of {h t (0_;C)} - producing a small residual; and 2. Due to the goodness, locally, of the model {h h_? (0; C)} estimates are more nearly optimal if the weight applied to the new sample is relatively low compared with the weight given to the aggregate of earlier samples. The other point that should be made is that, for the present maneuver, a newly computed reference {0_r(t)}, t < tl, (according to new {C__} estimates for a new {0_r(t/)} using Eq. (5.4) above) merely needs to connect continuously, typically including continuity in the first two time derivatives of {0_}, with the previously determined {0r(t)} at some near-future instant
An Alternative Approach for Image-Plane Control of Robots
49
t~, t4 < t~ < t f, of transition between the two. There are many strategies [9] which may be used to produce such smooth-transitioning reference joint-rotation trajectories.
6. C o n d i t i o n s
which
favor
CSM
The five conditions which tend to favor CSM, introduced above, are now considered one by one. 1. Unreliability of an ongoing error signal characterizes the visual feedback. Some tasks require the end effector to obscure the target object from view of the camera. The brake-plate-mounting experiment of Fig. 2.1(a) is one example. Even if the nature of the task were such that the end effector did not obscure the target object from view there are a multitude of factors present in any industrial setting which could cause some type of obscuration. For instance, the position of the robot could affect the lighting in such a way as to cause the target object to be unrecognizable. Perhaps other equipment in the industrial setting could interfere with the line of sight of the camera. With the severance of an error signal, {e}, it becomes impossible to solve Eq. (4.1) directly, thus interrupting the closed-loop VS. CSM, as an openloop method, is robust with respect to the unreliability of an ongoing error signal. 2. A zeroeth-order (algebraic) relationship between the "input" {0} and the "output" {X_~}: The ability to operate "open loop" to reach camera-space goals is dependent upon the particular order - zero - of the relationship {h t } between the input {0} and the output {X~}. Part of the reason for this lies with the indirect and imperfect control of {0_}. With a computed reference trajectory {o_r(t)} determined, the friction-dominated and otherwise complicated, high-order, joint-level control system attempts to track {o_r(t)} to produce an actual {0_(t)} in a virtually unpredictable way until the system comes to rest. If the response of {X c) were dependent upon the path of 0_ actually tracked, then, as is the case for nonholonomic robots such as the wheelchair of Fig. 6.1, there would be significant problems associated with open-loop pursuit of camera-space goals. To illustrate this, consider a holonomic, 2DOF system, similar to that of Fig. 3.1(a), and a nonholonomic one - a wheelchair for which the relationship between the input wheel rotations and the output position and orientation of the chair is at best differential. Fig. 6.2 depicts a large error in the terminal position of a wheelchair due to errors in tracking the joint rotations. If an error in tracking the reference joint rotations occurs, returning to the reference path will cause a holonomic robot to terminate at the desired pose, whereas, in general, a nonholonomic robot will terminate at an incorrect pose under the same circumstance. Hence VS might have a comparative advantage for nonholonomic robots; but for holonomic robots, where tip-location outcome is independent of the actual trajectory of {0_}, CSM is appropriate.
50
Michael Seelinger, Steven B. Skaar, and Matthew Robinson
Fig. 6.1. Nonholonomic system.
Fig. 6.2. Consequences of joint-level tracking errors on maneuver outcome: nonholonomic vs. holonomic systems. 3. The signal on the basis of which feedback (for VS) - or system i.d. (for CSM) - would be based is noisy, with large-amplitude noise. It is well known that visual information contains significant levels of noise [1, 26]. This noise is caused by m a n y factors including high sensitivity of the vision system to spurious lighting, changes in the working environment, and errors incurred in the digitalization process due to relatively low-resolution vision sensors. For the closed-loop VS system, successful maneuver termination arises by driving to zero the error signal generated by a constant stream of visual images. As mentioned previously, each of these images is subject to a significant amount of noise. Because there is no filtering, the positioning precision of the system is subject to the full amount of error found in any given visual image. CSM makes use of all the available data t h a t has been generated in a given trajectory or set of trajectories. Use of all of this historical data, has the beneficial effect of averaging out the zero-mean noise present in any single frame. Experimental results have shown that sub-pixel positioning is achieved in the presence of much larger rms errors in camera-space location of a given detected cue. For instance, recent results of a drilling test show that the system reliably positions to within 1 mm, which corresponds to about 1/3 of a pixel in the reference frame of the cameras used [42]. In a similar set of experiments with a much smaller robot and better camera resolution, 1/3 m m positioning precision was achieved. This 1/3 m m also corresponds to about
An Alternative Approach for Image-Plane Control of Robots
51
1/3 pixel in the reference frame of the camera as it was configured for the experiment. By contrast, in our experiments, which may be typical in this regard, the rms of the camera-space location of a single feature within a single frame is approximately 1.5 pixels, almost five times the value of actual terminal positioning achieved with CSM. 4. The input-output relationship is nevertheless highly observable with the breadth and frequency of samples which are typically available during operation. As discussed above, practical observability of all six elements of the vector {C_} of the model {h t (~; C)} can require use of a preplanned trajectory. This is where simultaneous observations of the image-plane location of the manipulable, or end-effector, cues (see e.g. Fig. 2.1(c)), along with the robot joint rotations which are acquired during the current maneuver, are supplemented by a small number of similar, previously acquired, samples which entail a broader range of manipulator-joint space and camera space. Evidence of observability, given such supplementary information, is found in consistent determination of the same vector of parameters {{2} regardless of the particular schedule of observations or the particular motion which comprises the preplanned trajectory. The input-output relationship {h t (0_;{2) } is updated with heavily weighted local samples as the robot moves toward its terminal position. This updating information corrects for virtually all model errors in the local working region of camera- and joint-space, enabling the system to achieve high levels of precision in both three-dimensional position and orientation. Tests have shown that when updating with local information, where that information ends with 100 mm left to go, 1 mm precision (less than a third of a pixel) is achieved consistently. The level of positioning precision diminishes with growth in the "distance to go" such that, if information ends with 1000 mm remaining at the juncture where new visual information ends, the error reaches 3 or 4 mm. 5. Substantial time delay may be associated with the information used, with VS, for feedback, and with CSM, for estimation. The matter of time delay for image analysis will depend upon a variety of factors including the complexity of the imaging information and the availability of computer power. Bandwidth limitations on closed-loop stability could present a real problem for VS; this would be especially true if, for some reason, special marks or cues, such as those used in Fig. 2.1 are n o t permitted on the manipulated end-member or on the target object. As Hutchinson et al. point out, closed loop systems can become unstable as the delays are introduced into the system [27]. This instability has been manifested in the form of oscillations in the terminal position as reported by Feddema and Mitchell [18] and Papanikolopoulos et al. [35]. Papanikolopoulos et al. note that the endpoint oscillations can have a detrimental effect in the precision of the terminal position as oscillations can cause image blurring [35].
52
MichaelSeelinger, Steven B. Skaar, and Matthew Robinson
The time delay and intermittence associated with the processing of incoming information does not have the same detrimental effect with CSM.
7. A p o i n t o f c o n v e r g e n c e For all of the differences between CSM and VS, it is interesting to note that there is, in the context of the 2DOF problem presented above, a feasible coming together of the two methods. (This convergence does not extend to three-dimensional, rigid-body positioning because of further differences between the methods required to accommodate 3D.) The point of convergence would follow from the use of estimation to utilize information, for example, referring to Fig. 3.1, of {X_~(tl)}, {X__~(t2)}, {X_.c(t3)}, along with continuously measured {~_(t)}, in order to produce uninterrupted, ongoing, time-varying estimates of {X_.c(t) } which, together with an assumed stationary camera-space target {XB} , can produce nearly a continuous {e(t)} = {X s - X__~(t)} as required by the controller of Eq. (4.1). This estimation might be carried out similarly to a scheme which has been developed by Baumgartner et al. [5], reviewed below, for use with the nonholonomic, 2DOF system pictured in Fig. 6.1; the scheme is based upon "continuously available" samples of wheel rotations - similar to the continuously available joint rotations of the VS extension under discussion - together with sporadic image samples. The ability to make use of such estimates for VS would obviate the problems discussed above, including the problem of image-analysis time delay. Note that the feedback control system of Baumgartner et al. [5], which is not an example of CSM but which does use discrete-time, delayed image information to estimate continuously the controlled state (the position of a moving robot) proceeds with stability despite large time delay associated with image-processing. Also, just as the system of Baumgartner et al. [5] works with the erratic vision-system provision of video information to achieve feedback control of the wheelchair, so too the present extension of VS would be insensitive to interruption of direct visual access to image-plane error. If such time-varying estimates of {e_(t)}, which might be produced using jointrotation histories in a variation of the extended Kalman filter such as that given in Baumgartner et al. [5], are used through to the maneuver terminus according to the visual-servoing-feedback control law of Eq. (4.1), the terminal result could be identical - as discussed below - to that of the application of CSM. While researchers have suggested that estimation be used with VS for purposes of tracking the sensor-space position of the target [20, 38, 55, 56], there has been no suggestion of using an approach as outlined below to estimate the ongoing sensor-space position(s) of the manipulated point(s) of interest.
An Alternative Approach for Image-Plane Control of Robots
53
Fig. 8.1. Kinematics of wheelchair system. 8. G e n e r a t i n g
continuous
state-estimate
feedback
for
the n o n h o l o n o m i c s y s t e m A stochastic model for the state equations for the response of point A of the system of Fig. 8.1 to rotation of the two wheels, 01 and 02, can be written: dXA
- R[ul + u2] ~
dYA - R[ul + u2] dt
+ wl (t) = ~/1(X, u_(t)) + wl (t)
(8.1)
+ w2(t) =
(8.2)
+ w2(t)
de _ R[ul - u2] + w3(t) = ~/3(X,u(t)) + w3(t) dt b[ul + u2]
(8.3)
where the two control inputs ul and u2 are defined according to Ul = dO1~dr and u2 = dO2/dt, and where {X} = [XA YA r {_U} = [Ul U2]T. Moreover, {g_(X,_u(t))} = [gl g2 g3]T. Also, the elements of {w} = [wl w2 w3] T are assumed to be uncorrelated, Gaussian, random "white" noise with zero mean and given, constant covariance [Q] = E({w}{w}T), with E representing the expectation process [5]. (Note that, although this is a two-degree-offreedom system, three components of position/orientation, XA, YA, r can be controlled to reach separately specified terminal values given an appropriate trajectory of the inputs Ul = dO1/dt and u2 -- dO2/dt. This contrasts with the example system of Fig. 3.1 with its outcome dependence on the terminal values only of 01 and 0 2 - - and the consequent dependence of the angular position of the end member upon the terminal 01 and 02. The difference in controllability is due to the fact that the system of Fig. 3.1 is holonomic whereas that of Fig. 8.1 is nonholonomic. Note further that, while ul = dO1/dt and u2 -- dO2/dt cannot be controlled directly, they can be assessed "continuously", in real time, using wheel encoders; hence Eqs. (8.1-8.3) are a good basis for continuous estimation of the state {X}.) Video observations for the wheelchair system of visual "cues" which are placed at known locations throughout the environment are available only
54
MichaelSeelinger, Steven B. Skaar, and Matthew Robinson
Fig. 8.2. Reference trajectory. erratically due to the unpredictability of lighting, the unpredictability of visual access of the cues to the onboard cameras, and finite, variable imageprocessing time. Image-plane-location observations, {y}, of the cues are related to the state according to the calibrated model:
{_y(ti) } = {p (x(t0) } +
(8.4)
where ti is the instant at which a video observation is acquired as distinct from the later instant t* at which the observation becomes available following suitable digital processing of the image. Also, let [R] = E({v_}{v} T) be the covariance of the uncorrelated, Gaussian, observation noise with zero mean. Note that in general the diagonal elements of [R] are on the order of 1 pixel-squared and hence rather large because of the coarsely quantized pixel domain. It is the purpose of Eqs. (8.1-8.4) to create continuous-feedback estimates {X__(t)}, in order to track the reference pose trajectory {Xr(t)} as indicated in Fig. 8.2. Tracking is accomplished using a control law similar to that of the VS system of Eq. (4.1), i.e., a control law where the reference angularvelocity elements {Y2r} for the two wheel servomechanisms are determined based upon the instantaneous-error estimate {e(t)} = {X~(t) - X__(t)}. As with any control law f2~ = ~_(_e), the control law actually used [5] requires reasonably smoothly time-varying error estimates despite the erratic, noisy nature of the video observations. Since {Xr(t)} is given to be smoothly and continuously varying with time, this is achieved over the interval t*_ 1 < t < t~ by simply integrating numerically from {~(t~_llt~_l)} according to:
where {~7[(t*_l]ti_l)} is the estimate of {X} at the time ti* 1 at which the last processed sample became available given all of the vision information up to and including the sample {y(ti-1)}.
An Alternative Approach for Image-Plane Control of Robots
55
Note that if, due to image obscuration or long processing times, {y_(ti-1)) is the last available video sample, no adjustment of the general scheme is required in order to apply the control law ~ r = ~_(e) through to the terminus {Xr(t])) without new or ongoing video sampling. If, however, at a later instant t~, prior to the end of the trajectory event, a subsequent {y(ti)) does become available it is possible to update the earlier estimate {X__(tilti_l)}; this is accomplished, using the extended Kalman Filter (KF) [19, 30] according to:
where the "Kalman gain" [K(ti)] is figured optimally [19, 30]. The importance of the new information, {y(ti)), and the residual that it creates, {r_(ti)) = {y(ti) -p(:K(ti [ti-1)), depends upon the magnitude of K, the Kalman gain, which in turn depends upon the error variances of [R] and [Q] which are associated with measurement noise and process noise respectively. Generally, according to the KF, the larger [Q] is relative to [R] (i.e. higher confidence in the new samples, lower confidence in the "plant" or process model) the larger K will be, and therefore the more emphasis will be placed upon the new sample; furthermore, as more time elapses between ti-1 and ti, K grows, thus placing more emphasis upon the new sample. Now because of finite processing time, it becomes necessary to transition the estimate {X__(tilti)) to {X__(t;Iti)}, from which point Eq. (8.5) can be applied, and so on through to maneuver termination. This transitioning occurs [5] by integrating in real time, from t = ti, a state-transition matrix [3], [~(t~, ti)], such that
where [~(t~, ti)] is computed by integrating numerically in real time, along with Eq. (8.5), from initial conditions [~(ti, ti)] = [I], according to
ff;[G
t,_l),u
where [G(X, u_)] = [0_~(X, u)/0X]. For digital-video presentation via the W W W of the control system of Fig. 8.1 in real-time operation using the above estimation/control scheme, see 2. This video demonstrates that the above estimation scheme is practical where discrete-time Vision information with unpredictable availability and significant image-processing delay is used to estimate the system state
2 http://www.nd.edu/NDInfo/Research/-sskaar/jdscircuit.mov
56
MichaelSeelinger, Steven B. Skaar, and Matthew Robinson
continuously, and where, based upon this state estimate, a control law determines real-time reference input for angular-velocity servomechanisms. The suggestion here is that it is just this capability which is required to make VS practical (provided camera, manipulator base and workpiece remain stationary over the course of the maneuver). The above development has not actually been applied to the VS problem; however its implementation for VS is outlined in a way that parallels the above discussion.
9. A p p l i c a t i o n
of estimation to VS
In a nearly identical way to that which is outlined above, the VS system of Fig. 3.1 could be expanded to apply estimation in order to produce continuous estimates of the camera-space coordinates, {X_~} = [Xc yc] T, of the positioned point A. In accordance with the above notation, call these estimates (X__~}. The counterpart to t h e observation at t = ti, Eq. (8.4), is simply {y(ti)} = {X__c(ti)} + {v_(ti)}
(9.1)
where, as before, [R] = E({_v}{v_}T). A subtle notational distinction is now introduced into Eq. (9.1) from the initial discussion of the video samples for the system of Fig. 3.1. Whereas {y(ti)} represents the camera-space sample of the location of A in the image at t = ti, as determined from the grayscale pixel matrix and the particular image-analysis routine in use, {X__c(ti)} takes on a new meaning which is somewhat more subtle than before. It is now assumed that, associated with the current angular position {~_(t)} is a true position of point A, {X__c(t)}, in the 2D camera-sensor - a cameraspace position which is not limited in resolution by the pixel quantization of the vision system, one which is fully consistent with the particular way in which the vision system maps junctures in the robot's physical workspace into camera space, and one which changes continuously and smoothly as the joint rotations of the robot, {_9}, change. {X__c(t)} is always the current best estimate of {X__c(t)}. The big advantage of Eq. (9.1) over Eq. (8.4) of the wheelchair system, in addition to its simplicity, is the fact that no calibration of the cameras is required in an attempt to relate the location of points in physical space to their location in camera space. VS and CSM obviate the need to establish any kind of prior mapping of physical space into camera space, thereby imparting great robustness and precision to the system, and saving much trouble and expense. (The wheelchair navigation system described above does, in contrast, rely upon camera calibration.) Also in parallel with the wheelchair development, estimates of the state during intervals between junctures of new-sample availability, t*_ 1 < t < t*, are found from:
An Alternative Approach for Image-Plane Control of Robots
57
where {u_} ----[Ul u~]T = [d•l/dt dO2/dt] T, i.e. the ongoing, measured angular velocities of the robot's two joints. In the case of the holonomic VS system, however, the functions {~_(X---c,u_)} involve the Jacobian introduced earlier. From Eq. (4.4)
Due to the holonomic nature of the system of interest, substitution of Eq. (9.3) into the integrand of Eq. (9.2) produces a perfect differential; this yields, for all ti-- 1 < t < t~, {X__~(t)} = {X_~ (t:_ll t,_l)} + {_h (~(t); D)} - {_h (~(tr_l) ;D)}
(9.4)
where {h_(~_;D_)} is defined in Eq. (5.1). The counterpart to Eq. (8.6) of the wheelchair system becomes simply
{~(ti]ti)} = {~c(ti]ti_l)} ~- [K(ti)]{y__(ti)-X_~_c(ti]ti_l)}
(9.5)
Using Eq. (8.7), we can transition the new-observation-based correction of the best state-estimate of Eq. (9.4) at the (now-past, due to image-processing time) instant ti to the current instant of new-image-information availability, C-
{X__~(t;I ti-1)} + [M (t;)] {r
(9.6)
where [M (t*)] -- [~5(t~,ti)] [K(ti)] can be c o n s i d e r e d a 2 • 2 "weighting" matrix for use at the time of new-image availability, t*, and {r_(ti)} is the "residual" or the difference at ti between the sampled and, based on information acquired previous to t*, expected position in the image of point A of Fig. 3.1.
10. Equivalence
of VS with
CSM
The algorithm suggested above becomes equivalent to the CSM algorithm in the special limiting case where the 6-parameter CSM estimation model of {h_f (0_;C) } given by
58
Michael Seelinger, Steven B. Skaar, and Matthew Robinson
r.lht(O_;___C)!
2 2 2" f (C 1 "~-C~ - C~ - C~) Ix(8) + 2 (C2C3 + ClC4) / -{'2(C2C4 CIC3)A(8-) -]- C5 1 2 ( c , c . - c c,) f,(o_) + ( c [ + ci - c2) ( +2(C3C4 + C~C2)fz(8_) + C6 =
-
fy(8__)I
-
(10.1) is modified to the 2-parameter form where the first four constants, C1 - C4 of Eq. (10.1), are taken to be already provided, based on some initial calibration, and only C5 and C6 remain to be estimated over the course of the current maneuver. It is natural that the coming together of CSM with VS would require prior establishment of C1 - C4: These are the parameters which relate to the orientation of the camera relative to the robot's base as well as the scaling of the mapping from physical space into camera space. As such, this information would be contained in the Jacobian [J] of the VS algorithm; recall that VS does require prior establishment of [J]. To segment the vector {C__)for the modified CSM algorithm, let {C I} = [C1 C2 C3 C4] T and the estimated {_CII ) -- [C5 C6] T. The model is rewritten as (ht(0_;_CI,_cIl)} where it is understood that only elements of {_C/z) are used for estimation. For the case where the last available video sample occurs at t = ti, Eq. (5.3) becomes:
i F t (C__J,CI ' ) = ~ {h t (0(tj);__CI,c H ) j~l
-yt(t,)}T (10.2)
where the modified jth camera-space sample of Eq. (5.3), (x_~t(tj)}, is reexpressed as {yt (tj)} in order to be consistent with the discussion immediately above. The necessary condition for minimizing F t, (oFt/Oc__ II} = {0_0_),results in two equations for the two elements of ( C H } . Following the notation introduced above of letting ti be the instant at which the most-recent video sample (ti)} is acquired, the necessary conditions for finding {C II Iti ) can, due to the linear appearance of {C__II) in Eq. (10.1), be written in terms of the previous estimate { C__I I Iti_ 1 }:
{yt
(10.3) where ( A C H} = {C__IIIti -C__IIIti_l }. From this we see that the updated best-estimate of (X__c(t*)} at the time t*, called here, as above, (X__c(t*lti)}, can be written in terms of the same estimate based upon information acquired only through ti-i, {X_~(t*[ti-i)}
An Alternative Approach for Image-Plane Control of Robots
59
{X~(t~lti)}= {X___~(t*lti_l)}+ ~-~W, [Wi]{yt(ti)-X_X~(tilti_l)} j=l
= { ~ (t* I t i - , ) } + [M (t*)] {r (ti) }
(10.4)
Of course for all times t* < t < t*i + 1 , the CSM estimates of {X_c(tlti)} are found according to:
=
1, ,,)}
(Where elements of {CI} are given as permanent values at the outset, there is no longer a need to supplement current-maneuver samples with samples from a preplanned trajectory.)
11. C o m p a r i s o n extension based
of the modified on estimation
CSM
with the VS
The point of convergence between the two methods occurs under the circumstances above at the critical juncture where point A of Fig. 3.1(a) approaches point B. Note that it is possible to select the current-sample weighting matrix [Wi] whereby updates of the estimates of CSM are the same as those of the VS strategy, i.e. such that [M(t*)] becomes the same in both Eqs. (9.6) and (10.4). Also, when point A reaches the target of point B, Eq. (10.5) becomes the same as Eq. (9.4) where the latter is applied to the interval t i < t < ti+ 1. Differences between the two methods arise as we leave the peculiar forms for which there is outcome convergence and resume the typical forms as reported in the literature. Even before returning to these forms, however, an important distinction can be drawn: Whereas convergence of terminal outcomes when A actually reaches B is shown above, CSM affords significantly more user control of the actual-trajectory which leads to this terminal point. In particular, use in VS of Eq. (4.1) results in loss of control as to the time of arrival of A at B as well as loss, in many respects, of the rate of progress toward B. CSM, in contrast, allows for specification of tf and of the entire {~_r(t)} which transitions the system from any current pose to the calculated {~_r(tf)}. Most robots are designed to track such reference joint-position sequences (as opposed to {t?r } sequences) with high accuracy. It could be noted that ifVS were implemented using an optimal estimator, such as a Kalman filter, to estimate the ongoing position of the point A of Figure 3.1(a), the "weight" placed upon each new video sample, figured optimally, would be relatively low in the sense that any one sample would have a relatively small influence upon the current estimates. This is because
60
MichaelSeelinger, Steven B. Skaar, and Matthew Robinson
of the relatively high [R], as mentioned above, which is associated with noise of each new video sample, and, locally, the relatively small expected model error, as represented by a small [Q]. The counterpart to this statement in the context of CSM is that the weighting matrices [Wi] of Eq. (5.2) should diminish only slowly from the highest (most recent) 'T' to "i - 1", and so on. It is this ability to combine, with nearly equal weight, information acquired over several recent samples which accounts for the reliable sub pixel precision of CSM. While it is possible to select the current-sample weighting matrix, [Wi], of Eq. (5.2) such that [M(t*)] becomes the same in both Eqs. (9.6) and (10.4), it is neither the optimal nor usual choice [45, 46] for the best performance of the CSM system. This is due to the fact that the KF-based matrix, [M(t~)], in Eq. (9.6) would grow and increase with time alone. This time-driven weighting scheme presents a problem in that the nature of the system is such that error in the estimates of {C} do not grow with time, per se, but rather with the forward evolution of {~_} itself. Thus, it is better to "deweight" earlier visual samples in proportion (or in relation) to forward movement of the jointrotation vector rather than making the weighting scheme simply a function of time. Return of CSM to its conventional form involving all six elements of the vector {___C},likewise, results in considerable improvement in robustness and precision. This improvement is due to the fact that, over the course of a typical maneuver, C1 - C 4 in fact change enough, due to local operating conditions, that failure to update them in response to local (in joint-space and camera-space) samples significantly limits the precision that is available. For example, consider the system of Fig. 2.1(c). As the position and orientation of the tip shifts from one etching pose to another, the orientation of any camera relative to manipulator base, as reflected in local estimates of C1 6'4 changes, typically, by one degree. Of course the physical orientation of these two bodies does not change; but local C1 - C4 estimates "absorb" or accommodate imperfections in the manipulator-kinematics model and/or the vision-system model. This extent of change, if not adjusted for, causes over a millimeter of error in tool-tip position which is triple the typical error. As VS returns from the estimation-based form suggested above to any of its forms given in the literature, in contrast, it is likely to degrade in effectiveness due to the problems discussed herein of coarse real-time imageerror samples, intermittence of the visual input, and loss of the error "signal" near closure.
12. E x t e n s i o n
to three dimensions
The extension to three dimensions is fundamentally distinct between the two methods. CSM makes use of two or more widely separated (in angle) cameras [12] each of which produces estimates of its own "camera-space"
An Alternative.Approach for Image-Plane Control of Robots 9
61
d
0
B arget
C
Fig. 12.1. Multiple camera/3-D system.
kinematics, thereby mapping n-dimensional joint space into 2-dimensional camera-space, according to the same procedure outlined above for the 2D case. With such best estimates in place, and with camera-space targets already in hand for each participating camera, there are generally more cameraspace objectives than robot degrees of freedom (or, indeed, degrees of freedom intrinsically required by the maneuver). Therefore, Eq. (5.4) for the determination of the terminal values {0_r(t/) ) needs to be altered to reflect multiple cameras and the redundant maneuver objectives. This is accomplished, for example in the case of three-dimensional point positioning, with the three-axis robot of Fig. 12.1, by estimating a separate set of camera-specific parameters to produce a separate set of estimates for each participating camera - for the jth camera denoted here by {C_j}. The jth camera would have its own camera-space target, {XJB}, and {0r(t/) } is calculated by minimizing H(0__) given by [46] H(~_) = Z {h' (8;C j ) - X ~ J
{ h f (0__;C__j)-X~}
(12.1)
according to
The more widely separated and larger in number the participating cameras, the more physically accurate the positioning [12]; also, with CSM, use of longfocal-length cameras placed at a distance from the workpiece tends to benefit terminal positioning accuracy [46]. VS solutions, in contrast, rely upon either a single stereo pair or a single camera where depth information is dependent
62
Michael Seelinger, Steven B. Skaar, and Matthew Robinson
upon camera perspective effects. Moreover, the use of Eqs. (12.1) and (12.2), and hence the use of multiple independent cameras to solve for the internal joint rotations required for closure, manifestly depends upon an algebraic description, rather t h a n differential-only description, as per VS, of the camera-space kinematics. The extension to higher degrees of manipulator freedom and rigid-body positioning (i.e. orientation control as well as position control) of CSM and VS likewise is significantly distinct, with the approach used in the latter tending to be highly dependent upon the geometry of the components involved. For a discussion of the general approach used in CSM see [24].
13. S u m m a r y and C o n c l u s i o n s A comparison is drawn between CSM and VS. For the class of problems involving both a stationary workpiece as well as stationary cameras and a stationary, holonomic robot, CSM is shown to be preferable due to its reliance upon estimation, joint-trajectory planning and attendant control over the time of closure, and, for 3D applications, multiple, independent, uncalibrated cameras.
References 1. P. K. Allen, A. Timcenko, B. Yoshimi, and P. Michelman. Real-time visual servoing. In Proc. IEEE International Conference on Robotics and Automation, pages 1850-1856, 1992. 2. P. K. Allen, A. Timcenko, B. Yoshimi, and P. Michelman. Hand-eye coordination for robotic tracking and grasping. In K. Hashimoto, editor, Visual Servoing, pages 33-69. World Scientific, 1993. 3. E. T. Baumgartner. An autonomous vision-based mobile robot. PhD thesis, University of Notre Dame, Notre Dame, IN, 1992. 4. E. T. Baumgartner, M. J. Seelinger, M. Fessler, A. Aldekamp, E. GonzalezGalvan, J. D. Yoder, and S. B. Skaar. Accurate 3-d robotic point positioning using camera-space manipulation. In T. J . Rudolph and L. W. Zachary, editors, Twenty-Fourth Midwestern Mechanics Conference, 1995. 5. E. T. Baumgartner and S. B. Skaar. An autonomous, vision-based mobile robot. IEEE Transactions on Automatic Control, 39(3):493-502, 1994. 6. D. J. Bennet, D. Geiger, and J. M. Hollerbach. Autonomous robot calibration for hand-eye coordination. International Journal of Robotics Research, 10(5):550-559, 1991. 7. R. Bernhardt and S. L. Albright. Robot Calibration. Chapman and Hall, London, 1993. 8. Z. Bien, W. Jang, and J. Park. Characterization and use of feature-jacobian matrix for visual servoing. In K. Hashimoto, editor, Visual Servoing, pages 317-363. World Scientific, 1993. 9. M. Brady, J. Hollerbach, T. Johnson, T. Lozano-Perez, and M. Mason. Robot Motion: Planning and Control. MIT Press, Cambridge, 1982.
An Alternative Approach for Image-Plane Control of Robots
63
10. A. Castano and S. A. Hutchinson. Visual compliance: task-directed visual servo control. IEEE Transactions on Robotics and Automation, 10(3):334-342, 1994. 11. F. Chaumette, P. Rives, and B. Espiau. Classification and realization of the different vision based tasks. In K. Hashimoto, editor, Visual Servoing, pages 199-228. World Scientific, 1993. 12. W. Z. Chen, U. Korde, and S. B. Skaar. Position-control experiments using vision. International Journal of Robotics Research, 13(3):199-204, 1994. 13. P. I. Corke. Video-rate robot visual servoing. In K. Hashimoto, editor, Visual Servoing, pages 257-284. World Scientific, 1993. 14. P. I. Corke. Visual control of robot manipulators - a review. In K. Hashimoto, editor, Visual Servoing, pages 1-32. World Scientific, 1993. 15. P.I. Corke and P~. P. Paul. Video-rate visual servoing for robots. In V. Hayward and O. Khatib, editors, Experimental Robotics, 16. L. J. Everett, M. Driels, and B. W. Mooring. Kinematic modelling for robot calibration. In Proc. IEEE International Conference on Robotics and Automation, pages 183-189, 1987. 17. J. T. Feddema, C. S. G. Lee, and O. R. Mitchell. Feature-based visual servoing of robotic systems. In K. Hashimoto, editor, Visual Servoing, pages 105-138. World Scientific, 1993. 18. J. T. Feddema and O. R. Mitchell. Vision-guided visual servoing with featurebased trajectory generation. IEEE Transactions on Robotics and Automation, 5(5):691-700, 1989. 19. A. Gelb. Applied Optimal Estimation. MIT Press, Cambridge, 1974. 20. V. Genenbach, H.-H. Nagel, M. Tonka, and K. Schafer. Automatic dismantling integrating optical flow into a machine vision-controlled robot system. In Proc. IEEE International Conference on Robotics and Automation, pages 1320-1325, 1996. 21. E. Gonzalez-Galvan and S. B. Skaar. Servoable cameras for three-dimensional positioning with camera-space manipulation. In Proc. IASTED Robotics and Manufacturing, pages 260-265, 1995. 22. E. J. Gonzalez-Galvan, M. Seelinger, J. D. Yoder, E. Baumgartner, and S. B. Skaar. Control of construction robots using camera-space manipulation. In L. A. Demsetz, editor, Robotics for Challenging Environments, pages 57-63, 1996. 23. E. J. Gonzalez-Galvan and S. B. Skaar. Efficient camera-space manipulation using'moments. In Proc. IEEE International Conference on Robotics and Automation, pages 3407-3412, 1996. 24. E. J. Gonzalez-Galvan, S. B. Skaar, U. A. Korde, and W. Z. Chen. Application of a precision enhancing measure in 3-d rigid-body positioning using cameraspace manipulation. International Journal of Robotics Research, 16(2):240-257, 1997. 25. K. Hashimoto. Visual Servoing. World Scientific, Singapore, 1993. 26. B. Horn. Robot Vision. MIT Press, Cambridge, 1986. 27. S. Hutchinson, G. Hager, and P. Corke. A tutorial on visual servo control. IEEE Transactions on Robotics and Automation, 12(5):651-670, 1996. 28. W. Jang, K. Kim, M. Chung, and Z. Bien. Concepts of augmented image space and transformed feature space for efficient visual servoing of an 'eye-in-hand robot'. Robotica, 9:203-2.12, 1991. 29. U. A. Korde, E. Gonzalez-Galvan, and S. B. Skaar. Three-dimensional cameraspace manipulation using servoable cameras. In Proc. SPIE Intelligent Robots and Computer Vision, pages 658-667, 1992. 30. F. L. Lewis. Optimal Estimation With An Introduction To Stochastic Control Theory. John Wiley & Sons, New York, 1986.
64
Michael Seelinger, Steven B. Skaar, and Matthew Robinson
31. S. Maybank and O. D. Fangeras. A theory of self-calibration of a moving camera. International Journal of Computer Vision, 8(2):123-151, 1990. 32. R. K. Miller, D. G. Stewart, H. Brockman, and S. B. Skaar. A camera space control system for an automated forklift. IEEE Transactions on Robotics and Automation, 10(5):710-716, 1994. 33. B. W. Mooring, Z. S. Roth, and M. Driels. Fundamentals of Manipulator Calibration. John Wiley and Sons, New York, 1991. 34. B. Nelson, N. P. Papanikolopoulos, and P. K. Khosla. Visual servoing for robotic assembly. In K. Hashimoto, editor, Visual Servoing, pages 139-164. World Scientific, 1993. 35. N. Papanikolopoulos, P. K. Khosla, and T. Kanade. Vision and control techniques for robotic visual tracking. In Proc. IEEE International Conference on Robotics and Automation, pages 857-864, 1991. 36. G. V. Puskorius and L. A. Feldkamp. Global calibration of a robot/vision system. In Proe. IEEE International Conference on Robotics and Automation, pages 190-195, 1987. 37. P. Rives, F. Chaumette, and B. Espian. Poistioning of a robot with respect to an object, tracking it .and estimating its velocity by visual servoing. In V. Hayward and O. Khatib, editors, Experimental Robotics, pages 412-428. Springer Verlag, 1989. 38. A. A. Rizzi and D. E. Koditschek. An active visual estimator for dextrous manipulation. IEEE Transactions on Robotics and Automation, 12(5):697-713, 1996. 39. A. C. Sanderson and L. E. Weiss. Image-based visual servo control using relational graph error signals. In Proc. IEEE, pages 1074-1077, 1980. 40. P. S. Schenker and et al. Mars lander robotics and machine vision capabilities for in situ planetary science. In Proc. IEEE Intelligent Robotics and Computer Vision XIV, volume 2588, pages 351-353, 1995. 41. M. J. Seelinger, E. Gonzalez-Galvan, S. B. Skaar, and M. Robinson. Point-andclick objective specification for a remote semiautonomous robot system. In P. S. Schenker and G. T. McKee, editors, Proe. SPIE Sensor Fusion and Distributed Robotic Agents, volume 2905, pages 206-217, 1996. 42. M. J. Seelinger, M. Robinson, Z. Dieck, and S. B. Skaar. A vision-guided, semiautonomous system applied to a robotic coating application. In P. S. Schenker and G. T. McKee, editors, Proc. SPIE Sensor Fusion and Decentralized Control in Autonomous Robotic Systems, volume 3209, pages 133-144, 1997. 43. R. Sharma and S. Hutchinson. Motion perceptibility and its application to active vision-based servo control. IEEE Transactions on Robotics and Automation, 13(4):607-617, 1997. 44. S. B. Skaar. An adaptive vision-based manipulator control scheme. In Proc. AIAA Guidance, Navagation and Control Conference, pages 608-614, 1986. 45. S. B. Skaar, W. H. Brockman, and R. Hanson. Camera space manipulation. International Journal of Robotics Research, 6(4):20-32, Winter 1987. 46. S. B. Skaar, W. H. Brockman, and W. S. Jang. Three dimensional camera space manipulation. International Journal of Robotics Research, 9(4):22-39, 1990. 47. S. B. Skaar, W. Z. Chen, and R. K. Miller. High resolution camera space manipulation. In Proc. ASME Design Automation Conference, pages 608-614, 1991. 48. S. B. Skaar and E. Gonzalez-Galvan. Versatile and precise manipulation using vision. In S. B. Skaar and C. F. Ruoff, editors, Teleoperation and Robotics in Space, pages 241-279. AIAA, Washington, D.C., 1994.
An Alternative Approach for Image-Plane Control of Robots
65
49. S.B. Skaar, I. Yalda-Mooshabad, and W. H. Brockman. Nonholonomic cameraspace manipulation. I E E E Transactions on Robotics and Automation, 8(4):464479, 1992. 50. K. Tani, M. Abe, and T. Ohno. High precision manipulator with visual sense. In Proc. ISIR, pages 561-568, 1977. 51. K. Toyama, G. Hager, and J. Wang. Servomatic: A modular system for robust positioning using stereo visual servoing. In Proc. I E E E International Conference on Robotics and Automation, pages 2636-2642, 1996. 52. H. Trivedi. A semi-analytic method for estimating stereo camera geometry from matched points. Image and Vision Computing, 9:227-236, 1991. 53. R. Y. Tsai. A versatile camera calibration technique for high accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses. I E E E Transactions on Robotics and Automation, 3(4):323-344, 1987. 54. J. Weng, P. Cohen, and M. Herniou. Camera calibration with distortion models and accuracy evaluation. I E E E Transactions on Pattern Analysis and Machine Intelligence, 14(10):965-980, 1992. 55. W. J. Wilson. Visual servo control of robots using kalman filter estimates of robot pose relative to work-pieces. In K. Hashimoto, editor, Visual Servoing, pages 71-104. World Scientific, 1993. 56. P. Wunsch and G. Hirzinger. Real-time visual tracking of 3-d objects with dynamic handling of occlusion. In Proc. I E E E International Conference on Robotics and Automation, pages 2868-2873, 1997. 57. D. B. Zhang, L. V. Gool, and A. Oosterlinck. Stochastic predictive control of robot tracking systems with dynamic visual feedback. In Proc. I E E E International Conference on Robotics and Automation, pages 610-615, 1990. 58. H. Zhuang and Z. S. Roth. Camera-Aided Robot Calibration. CRC Press, Boca Raton, 1996.
Potential Problems of Stability and Convergence in Image-Based and Position-Based Visual Servoing Franw
Chaumette
IRISA / INRIA Rennes Campus de Beaulieu 35 042 Rennes cedex, France Summary. Visual servoing, using image-based control or position-based control, generally gives satisfactory results. However, in some cases, convergence and stability problems may occur. The aim of this paper is to emphasize these problems by considering an eye-in-hand system and a positioning task with respect to a static target which constrains the six camera degrees of freedom.
1. I n t r o d u c t i o n The two classical approaches of visual servoing (that is image-based control and position-based control) are different in the nature of the inputs used in their respective control schemes [28, 10, 7]. Even if the resulting robot behaviors thus also differ, both approaches generally give satisfactory results: the convergence to the desired position is reached, and, thanks to the closed-loop used in the control scheme, the system is stable, and robust with respect to camera calibration errors, robot calibration errors, and image measurements errors. That is particularly true when only few degrees of freedom are visionbased controlled (such as camera pan and tilt for target tracking), However, in some cases, convergence and stability problems may occur. The aim of this paper is to emphasize these problems with simple and concrete examples. We consider in this paper an eye-in-hand system, and a generic positioning task with respect to a static target which constrains the six camera degrees of freedom. However, most of the given results can be generalized to other kind of systems, such as for example monocular or binocular vision systems simultaneously observing a target and a robot end-effector [1, 9].
2. I m a g e - b a s e d
visual
servoing
Image-based visual servoing is based on the selection in the image of a set s of visual features that has to reach a desired value s*. Usually, s is composed of the image coordinates of several points belonging to the considered target. However, it may be interesting to use other geometrical visual features (such as the parameters which represent the image of a straight line, or of a sphere,
Potential problems of stability and convergence in visual servoing
67
etc.) in function of the vision-based task that has to be achieved and the nature of the present objects [2, 3]. As for s*, it is obtained either during an off-line learning step (where the robot is moved at its desired position with respect to the target and the corresponding image is acquired), either by computing the projection in the image of a 3D model of the target for the desired camera pose (which necessitates a perfect camera calibration and a perfect knowledge of the 3D target model in order that the camera does reach the given pose). It is well known that the image Jacobian Js, also called interaction matrix related to s, plays a crucial role in the design of the possible control laws. Js is defined from: ~=
-~r
(2.1)
- ~ = Js T
where T = dr __ (V T, ~'~T)T is the camera velocity screw (V and/-2 represent its translational and rotational component respectively). Using a classical perspective projection model with unit focal length, and if X and Y coordinates of image points are selected in s, Js is directly obtained from: =
( z0
-1/z
Y/z
1 + y2
-XY
-
where z is the depth of the corresponding point in the camera frame. All the existing control schemes compute the camera velocity sent to the robot controller (or directly the robot joints velocity, by introducing the robot aacobian, if joint limits and kinematics singularities avoidance is needed [21, 17]) with the following forms ([8, 11, 23], etc.): T=J+~
or
T=f(J+(s-s*))
(2.3)
where function f may be as simple as a proportional gain [2], or a more complex function used to regulate s to s* (optimal control [23], non-linear control [11], etc.), and J+ is a model, an approximation, or an estimation of the pseudo-inverse of Js. Indeed, camera calibration errors, noisy image measurements, and unknown depth z involved in (2.2) imply the use of such model, since the real value of Js remains unknown. A well known sufficient condition to ensure the global asymptotic stability of the system (that is of course non-linear since X and Y are involved in (2.2)) is [13]: J+Js(s(t),z(t))
> 0, Vt
(2.4)
This condition, even if it is difficult to exploit in practice, allows one to set the possible choices for J+. In fact, three different cases have been considered in t~.e liter~ure: 9 J + = J + ( t ) . In that case, the image Jacobian is numerically estimated during the camera motion without taking into account the analytical form
68
Franqois Chanmette
given by (2.2) [13, 15] (neural networks have also sometimes been used [27]). This approach seems to be very interesting if any camera and robot models are available. However, it is impossible in that case to demonstrate when condition (2.4) is ensured. Furthermore, initial coarse estimation of the image Jacobian may lead to unstable results, especially at the beginning of the servoing, and some visual features may get out of the camera field of view (seeAFigure 2.1.a).
9 J+ = J+(s(t), ~(t)). The image Jacobian is now updated at each iteration of the control law using in (2.2) the current measure of the visual features and an estimation ~(t) of the depth of each considered point. ~(t) can be obtained either from the knowledge of a 3D model of the object [4], either from the measure of the camera motion [26, 2]. This case seems to be optimal since, ideally, we thus have J+J8 = H, Vt, which of course satisfies condition (2.4) and implies a perfect decoupled system. Each image point is constrained to reach its desired position following a straight line (see Figure 2.1.b). However, we will see that such a control in the image may imply inadequate camera motiAon, leading to possible local minima.and the nearing of task singularities. 9 J + = J+(s*, z*). In this last case, J + is constant and determined during an off-line step using the desired value of the visual features and an approximation of the points depth at the desired camera pose. Condition (2.4) is now ensured only in a neighborhood of the desired position, and a decoupled behavior will be achieved only in a smaller neighborhood. Determining analytically the limits of these neighborhoods seems to be out of reach owing to the complexity of the involved symbolic computations. The performed trajectory in the image may be quite unforeseeable, and some visual features may get out of the camera field of view during the servoing (leading of course to its failure), especially if the initial camera position is far away from its desired one (see Figure 2.1.c).
,2
.v
@,
A
a) J+(t)
b) J+(s(t),~(t))
c) J + ( s * , ~ )
A
Fig. 2.1. Possible choices for J+ and corresponding behavior: black points represent the initial position of the target in the image, and gray points and lines respectively represent its desired position and a possible trajectory in the image)
Potential problems of stability and convergence in visual servoing
69
Image-based visual servoing is known to be generally satisfactory, even in the presence of important camera or hand-eye calibration errors [5]. However, we now exhibit the following stability and convergence problems which may occur:
- J, or even J, may become singular during the servoing, which of course leads to an unstable behavior. - local minima may be reached owing to the existence of unrealizable image motions.
2.1 Reaching or nearing a task singularity It is well known that the image Jacobian is singular if s is composed by the image of three points such that they are collinear, or belong to a cylinder containing the camera optical center [3, 19, 22]. Using more than three points generally allows us to avoid such singularities. However, we now demonstrate by a concrete example that, w h a t e v e r the n u m b e r of points and their configuration, the image Jacobian may b e c o m e singular during t h e visual servoing, if image points are chosen as visual features. Let us consider that the camera motion from its initial to desired poses is a pure rotation of 180 dg around the optical axis. This 3D motion leads to an image motion corresponding to a s y m m e t r y around the principal point. If J+(s(t), ~(t)) is used in the control scheme and perfect measurements and estimations are assumed, we can note that J+J8 = H for the initial camera position, which leaves us to expect a correct behavior. However, the obtained image trajectory of each point is a straight line such that all the points lie at the principal point at the same instant (see Figure 2.2.a). It corresponds to a pure backward translational camera motion along the optical axis (and unfortunately to a zero rotational motion around the optical axis), t h a t moves the camera at infinity. At this unexpected position, the image Jacobian of each point i is given by (see (2.2)): Jsi=
(~000-1~)001 0
(2.5)
The interaction matrix Js, a n d J + , are here of rank 2, instead of 6, which of course corresponds to a task singularity, and where condition (2.4) is no more ensured. Let us now consider the case where J+(s*, z*) is used in the control scheme. This choice implies t h a t the control law behaves as if the error in the image was as small as possible. It is clear from Figure 2.2.b (where white points correspond to such near position) that the obtained camera motion is now a pure forward translational motion along the optical axis (and, once again, without any rotational motion around the optical axis). T h e camera thus moves directly toward the target (in practice, the collision is avoided thanks to the getting out of the visual features from the image), and toward
70
Francois Chaumette
a singularity of Js- Indeed, when z -- 0, for all points not lying on the optical center, Jsi is given by: Jsi ~
0 cr cr o c ~
It is interesting to note that, in that case, J+(s*, ~ ) , which is used in the control scheme, is not singular. However, the problem occurs owing to the singularity of Js, which is involved in condition (2.4). In the two previous cases, the reaching of the singularity can be avoided if the camera rotation is less important. However, the coupling between translational and rotational camera motion implies a really unsatisfactory camera trajectory, by the nearing (and then the moving away) of t h e singularity. Furthermore, even for the considered initial position, the singularity can be avoided, and a perfect camera trajectory can be achieved (that is a pure rotational camera motion around its optical axis) if straight lines are used in s instead of points (see Figure 2.2.c). Indeed, by considering the (p, 0) parameters which describe the position in the image of a straight line [2], we obtain the results depicted on Figure 2.3. In the presented simulation results, we have used four lines from the six that can be defined from four points. As can be seen on the plots, the four errors Pi - P~ always remain to zero, while the four Oi - O* simultaneously converges from -180 dg to 0 dg, thanks to the computation of a pure camera rotational motion around its optical axis. Let us note t h a t similar experimental results can be obtained (calibration errors and noisy image measurements of course introduce small perturbations), and that these results do not depend on the number nor the configuration of the considered features. Let us finally claim t h a t we here have demonstrated t h a t using straight lines is s o m e t i m e s more interesting than using points, but nothing more. Indeed, some singularities may perhaps appear with straight lines for particular configurations, and some camera motion may perhaps be more adequately achieved using points (or anything else) t h a n straight lines.
0-
+
@
+0
+O0
(a) (b) (c) Fig. 2.2. Reaching (or not) a singularity: a) image motion using J+(s(t), image motion using J+(s*, ~ ) , c) s defined using 2D straight lines
~(t)), b)
Potential problems of stability and convergence in visual servoing
71
o 'r
_r
/;
,.~r
f
'or
,.~r ',v
.....
/
--
/ w
T (s - 8*) Fig. 2.3. Perfect behavior using 2D straight lines in s
2.2 Reaching local minima We now focus on another potential problem that m a y appear in practice. By definition, local minima are defined such that T = 0 and s ~ s* (or ~ ~ 0). This is equivalent to: s - s* 9 Ker J +
(or ~ G Ker J + )
(2.7)
If s is composed by three image points (such t h a t J8 is full rank 6), Ker J + = 0, which implies that there is none local minima. However, it is well known that the same image of three points can be seen from four different camera poses [12]. In other words, there exist four camera poses (that is four global minima) such that s = s* (or h = 0). A unique pose can theoretically be obtained by using at least four points. However, in t h a t case, J8 is of dimension 8 x 6, which implies that dim Ker J + = 2. This does not demonstrate that local minima always exist. Indeed, the configurations s such that s - s* (or h) E Ker J + must be physically coherent (which means that a corresponding camera pose exists). The complexity of the involved symbolic computations seems to make impossible the determination of general results. Particular cases can however be exhibited. In Figure 2.4 are presented the simulation results for a planar target composed of four points obtained using J~(s(t), ~(t)) in the control scheme. As can be seen, the visual features simultaneously decrease owing to the used strategy. However, a local minimum is reached since the camera velocity is zero while the final camera position is far away from its desired one. At that position, the error s - s* in the image do not completely vanish (residual error is approximately one pixel on each X and Y coordinate). Introducing noise in the image measurement leads to the same results, which can also be obtained in real experiments. It is interesting to note that the global minimum is correctly reached from the same initial camera position if Js(s*,z A*) is used in the control scheme (see Figure 2.5). In t h a t case, as can be seen on s - s*, the trajectory in
72
Francois Chaumette
Fig. 2.4. Reaching a local minima using ,~ =
Js(s(t),'~(t))
Potential problems Of stability and convergence in visual servoing
Fig. 2.5. Reaching the global minimum using ~ = J,(s*,~)
73
74
Franqois Chaumette
the image is quite surprising, as well as the computed control law, but this behavior allows the system to avoid the local minima. In fact, potential local minima are due to the existence of unrealizable image motions. Such motions ~• are defined by ~1 E Ker j T . Indeed, they are such that they do not belong to the range space of Js- In other words: ~• • Im J8 =~ ~• e (Im J~)• -- Ker j T
(2.8)
Once again, if s is composed of four points, dim Ker j T __ 2. An illustration, easily obtained from (2.2), is presented in Figure 2.6 in the case of four coplanar points parallel to the image plane and forming a square.
@
% Fig. 2.6. Example of unrealizable image motion: arrows in gray and black respectively represent one of the two spanning vectors of Ker jT. Points in gray represent an example of image configurations unreachable from points in black using Js(s(t),~(t)) in the control scheme. Reciprocally, points in black represent an example of image configurations unreachable from points in gray using J8 (s , z ) ,
The link between local minima and unrealizable image motions is obvious since we always have Ker j T _ Ker J+. Indeed, if the camera reaches a position such that (s - s*) E Ker j T , the behavior of the control law using Js(s(t), ~(t)) implies that ~ must also belong to Ker j T . By definition, there does not exist any camera motion T able to produce such imposed, but unrealizable, image motions. The camera is thus in a local minima. This can be easily demonstraated since, if we assume perfect measurements and estimation, we here have J + -- J + , which of course implies that, if (s - s*) E Ker j T , then (s - s*) E Ker J+. On the other hand, using
Js(s*,~)
generally implies t h a t Ker j T
Ker J + , which allows the system to avoid local minima if it computes image motions which are not unrealizable. In fact, one has to demonstrate that Ker j T M K e r J + = 0, or more precisely that the image configurations ( s - s * ) which belong to Ker j T A Ker J + are not physically coherent. For example, it does not seem that any camera pose exists to obtain the gray image configurations depicted on Figure 2.6 for the considered square object.
Potential problems of stability and convergence in visual servoing
75
Let us claim that we have shown in this subsection that it is s o m e t i m e s more interesting to use Js(s*, ~) instead of Js(s(t), ~(t)), but nothing more. Finally, we will see in the conclusion of the paper several sufficient and simple conditions to avoid such undesirable behaviors of the image-based visual servoing.
3. P o s i t i o n - b a s e d
visual servoing
Position-based visual servoing does not present the drawbacks described above. An estimation F of the camera pose r is now computed at each iteration of the control law [12, 4]. Control schemes are thus designed in order that F reaches a desired pose r* [29, 18]. Since the control schemes axe based on inputs expressed in the Cartesian space, this approach seems to be very satisfactory. Indeed, ideally, the task Jacobian J? = o~ is given by:
J^ = ( 0 g3 As(U) ]
(3.1)
where As(U) is the antisymmetric matrix associated to the vector U, defined by the origins of target to camera frames. Ideally, J^ is thus an upper triangular matrix (that ensures a perfect decoupling of the camera trajectory) which is always of full rank 6 and such that Ker j T = Ker J+ = 0. However, r r the form of J^ given by (3.1) is only valid when a perfect camera calibration, perfect image measurements, a perfect 3D model of the object, and a perfect pose estimation algorithm axe available... In fact, the real form of the task Jacobian is:
Os The second term (~-7) is nothing but the interaction matrix related to s
(see (2.1)). It is thus analytically known. On the other hand, (-~,~) which represents the variation of the estimated pose in function of a variation of the visual features, is unfortunately unknown, and depends on the used visual features, the shape of the object, the camera calibration parameters, and the used pose estimation algorithm (generally based on iterative numerical methods [4]). Theoretically demonstrating the global asymptotic stability of a position-based visual servoing seems therefore to be out of reach since a su a sufficient condition is given by (refer to (2.4)):
:^ > 0, vt
(3.3)
where J~+, used in the control scheme, can be chosen as (3.1) (it seems to be the only one possible satisfactory approximation) and where J~-is, as already stated, analytically unknown.
76
Francois Chaumette
Furthermore, since there is no more control in the image, it is impossible to ensure that the object will always remain in the camera field of view during the servoing, especially in the presence of important calibration errors. It is also well known that the Perspective from N Points problem, as most of inverse problems, is sometimes ill-posed and sensitive to perturbations (which implies a bad conditioning of ( ~oF) ) . . small errors in the image measurements may lead to very different results, especially in the case of a planar target, as can be seen on Figure 2.4. In such cases, the control law can be completely unstable. In practice, satisfactory results are obtained if a non coplanar object is considered [29, 18]. However, it would be very interesting to determine the number, the nature, and the configuration of features necessary to obtain an optimal behavior of the pose estimation To conclude, in position-based visual servoing, most of the control problems are reported on the pose estimation algorithm. However, that does not close the whole problem of integration of vision and control.
4. C o n c l u s i o n As exhibited in this paper, numerous problems stand to theoretically demonstrate the complete efficiency of any visual servoing scheme. As for imagebased visual servoing, an interesting challenge is to determine a universal representation of visual features adequate for any object, and from any initial camera position. It is known that the condition number of the image Jacobian plays an important role in the behavior of the system [7] (see also [20, 25] where similar measures are presented to select the adequate visual features). More precisely, the condition number of Js, and also of J+, has to be minimal. From the results obtained in this paper, supplementary sufficient conditions to ensure a correct modeling of an image-based task are that Ker J+ -- Ker j T = 0 in all the possible workspace (which also excludes the existence of task singularities). Intuitively, the six first inertial moments of the image of an object seem to be interesting candidates, even if observing a centered circle with such a representation (and with any unfortunately) is known to be a singular case [2]. The 2D1/2 visual servoing scheme that has been recently proposed [16], seems also to be an interesting approach to avoid convergence and stability problems.
References 1. P. Allen, A. Timcenko, B. Yoshimi, and P. Michelman. Automated tracking and grasping of a moving object with a robotic hand-eye system. I E E E Trans. on Robotics and Automation, 9(2):152-165, April 1993.
Potential problems of stability and convergence in visual servoing
77
2. F. Chaumette, S. Boukir, P. Bouthemy, and D. Juvin. Structure from controlled motion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(5):492504, May 1996. 3. F. Chaumette, P. Rives, and B. Espian. Classification and realization of the different vision-based tasks. K. Hashimoto, editor, Visual Servoing, pages 199228, World Scientific, Singapore, 1993. 4. D. Dementhon and L. Davis. Model-based object pose in 25 lines of code. Int. Journal of Computer Vision, 15(1/2):123-141, June 1995. 5. B. Espiau. Effect of camera calibration errors on visual servoing in robotics. In Third Int. Symposium on Experimental Robotics, Kyoto, Japan, October 1993. 6. B. Espian, F. Chaumette, and P. Rives. A new approach to visual servoing in robotics. IEEE Trans. on Robotics and Automation, 8(3):313-326, June 1992. 7. J. Feddema, C. Lee, and O. Mitchell. Automatic selection of image features for visual servoing of a robot manipulator. In IEEE Int. Conf. on Robotics and Automation, volume 2, pages 832-837, Scottsdale, Arizona, May 1989. 8. J. Feddema and O. Mitchell. Vision-guided servoing with feature-based trajectory generation. IEEE Trans. on Robotics and Automation, 5(5):691-700, October 1989. 9. G. Hager. A modular system for robust positioning using feedback from stereo vision. IEEE Trans. on Robotics and Automation, 13(4):582-595, August 1997. 10. K. Hashimoto, editor. Visual Servoing. World Scientific Series in Robotics and Automated Systems, Vol 7, World Scientific Press, Singapore, 1993. 11. K. Hashimoto and H. Kimura. Dynamic Visual servoing with nonlinear modelbased control. In 12th World Congress IFAC, volume 9, pages 405-408, Sidney, Autralia, July 1993. 12. R. Horand. New methods for matching 3d objects with single perspective view. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(3):401-412, May 1987. 13. K. Hosoda and M. Asada. Versatile visual servoing without knowledge of true jacobian. In I E E E / R S J Int. Conf on Intelligent Robots and Systems, pages 186-193, Munchen, Germany, September 1994. 14. S. Hutchinson, G. Hager, and P. Corke. A tutorial on visual servo control. IEEE Trans. on Robotics and Automation, 12(5):651-670, October 1996. 15. M. Jagersand, O. Fuentes, and R. Nelson. Experimental evaluation of uncalibrated visual servoing for precision manipulation. In IEEE Int. Conf. on Robotics and Automation, volume 3, pages 2874-2880, Albuquerque, New Mexico, April 1997. 16. E. Malis, F. Chanmette, and S. Boudet. Positioning a coarse-calibrated camera with respect to an unknown object by 2d 1/2 visual servoing. In IEEE Int. Conf. on Robotics and Automation, Leuven, Belgium, May 1998. (extended version available at ftp://ftp.irisa.fr/techreports/1998/PI-1166.ps.gz). 17. E. Marchand, F. Chaumette, and A. Rizzo. Using the task function approach to avoid robot joint limits and kinematic singularities in visual servoing. In IEEE//RSJ Int. Conf. on Intelligent Robots and Systems, volume 3, pages 10831090, Osaka, Japan, November 1996. 18. P. Martinet, N. Daucher, J. Gallice, and M. Dhome. Robot control using monocular pose estimation. In Workshop on New Trends in Image-based Robot Servoing, IROS'97, pages 1-12, Grenoble, France, September 1997. 19. H. Michel and P. Rives. Singularities in the determination of the situation of a robot effector from the perspective view of three points. Technical Report 1850, INRIA, February 1993.
78
Francois Chanmette
20. B. Nelson and P. Khosla. The resolvability ellipsoid for visual servoing. In IEEE Int. Conf. on Computer Vision and Pattern Recognition, pages 829-832, Seattle, Washington, June 1994. 21. B. Nelson and P. Khosla. Strategies for increasing the tracking region of an eye-in-hand system by singularity and joint limits avoidance. Int. Journal of Robotics Research, 14(3):255-269, June 1995. 22. N. Papanikolopoulos. Selection of features and evaluation of visual measurements during robotic visual servoing tasks. Journal of Intelligent and Robotics Systems, 13:279-304, 1995. 23. N. Papanikolopoulos, P. Khosla, and T. Kanade. Visual tracking of a moving target by a camera mounted on a robot: A combination of control and vision. IEEE Trans. on Robotics and Automation, 9(1):14-35, February 1993. 24. C. Samson, M. Le Borgne, and B. Espiau. Robot Control: the Task Function Approach. Clarendon Press, Oxford, United Kingdom, 1991. 25. R. Sharma and S. Hutchinson. Optimizing hand/eye configuration for visual servo systems. In IEEE Int. Conf. on Robotics and Automation, pages 172-177, Nagoya, Japan, May 1995. 26. C. Smith and N. Papanikolopoulos. Computation of shape through controlled active exploration. In IEEE Int. Conf. on Robotics and Automation, volume 3, pages 2516-2521, San Diego, California, May 1994. 27. I. Suh. Visual servoing of robot manipulators by fuzzy membership function based neural networks. K. Hashimoto, editor, Visual Servoing, pages 285-315, World Scientific, Singapore, 1993. 28. L. Weiss, A. Sanderson, and C. Neuman. Dynamic sensor-based control of robots with visual feedback. IEEE Journal of Robotics and Automation, 3(5):404-417, October 1987. 29. W. Wilson, C. Hulls, and G. Bell. Relative end-effector control using cartesian position-based visual servoing. IEEE Trans. on Robotics and Automation, 12(5):684-696, October 1996.
W h a t C a n B e D o n e w i t h an U n c a l i b r a t e d Stereo System? Jo~o Hespanha, Zachary Dodds, Gregory D. Hager, and A.S. Morse Center for Computational Vision and Control Departments of Electrical Engineering and Computer Science Yale University New Haven, CT, 06520 USA
S u m m a r y . Over the last several years, there has been an increasing appreciation of the impact of control architecture on the accuracy of visual servoing systems. In particular, it is generally acknowledged that so-called image-based methods provide the highest guarantees of accuracy on inaccurately calibrated hand-eye systems. Less clear is the impact of the control architecture on the set of tasks which the system can perform. In this article, we present a formal analysis of control architectures for hand-eye coordination. Specifically, we first state a formal characterization of what makes a task performable under three possible encoding methods. Then, for the specific case of cameras modeled using projective geometry, we relate this characterization to notions of projective invaxiance and demonstrate the limits of achievable performance in this regard.
1. I n t r o d u c t i o n In the design of vision-based control systems, two families of control architectures dominate [4, 11]. T h e first family, characterized by the position-based approach, computes feedback from estimates of geometric quantities (pose, feature position, etc.) in the Cartesian space of the robot. The second family, the image-based approach, computes feedback directly from image features. We now restrict our attention to a two-camera system observing b o t h controlled features (e.g. those on a robot manipulator) and a (static) target. It has been observed t h a t there exist approaches which can achieve precise task positioning, even though the relationship between the manipulator and the observing cameras may not be precisely known. [11, 8, 12]. In particular, prior work has shown t h a t one route toward defining tasks performable with absolute precision is to take advantage of the invariance of certain image-level constructions to perspective projection ([7, 9, 2]). It is also known t h a t there are tasks which, under certain conditions, are performable using image-based methods and not performable using position-based methods [8]. In [6] Faugeras specifies the extent to which the world is reconstructible using a pair of uncalibrated cameras. It is shown that with a "weakly calibrated" stereo rig, i.e., one in which the epipolar geometry of the cameras is known, reconstruction can be performed up to a projective transformation.
80
Jo~m Hespanha, Zachary Dodds, Gregory D. Hager, and A.S. Morse
This result suggests an additional line of investigation - - namely, to what extent does a weakly calibrated rig affect task performability? 1 Such observations motivate the central question of this article: what tasks
are precisely performable using an imprecisely calibrated pair of cameras? More specifically, is the general set of tasks performable under image-based control strictly larger than for position-based control? Can one precisely characterize the limits of performability? How is the notion of performability related to projective invariance? In this article we provide the following answers to the above questions. First, we establish a formal characterization of the set of performable, or testable, tasks and show that this is exactly the set of tasks performable by image-based methods for both uncalibrated and weakly calibrated systems. We also provide necessary and sufficient conditions for a task to be testable with a weakly calibrated vision system. In particular, the assertion t h a t a task is testable under weak calibration is equivalent to the statement that the task is invariant under projective transformations. Further, projective invariance is necessary, but not sufficient, to ensure t h a t a task is testable with an uncalibrated vision system. Examples of projectively invariant tasks which are not testable in the uncalibrated case are demonstrated. These results bridge the classical computer-vision notions of projective geometry with sensor-based strategies from control theory; they provide a concise characterization of the limits of performability for vision-based motion control architectures. The remainder of this paper is organized as follows. In the next section, we define and discuss the notion of task encoding and derive results regarding the performability of tasks with absolute precision. Following that, we focus on the class of image-based tasks and show that this set is maximal. In Section 4. we relate performability to invariance over projective transformations. Section 5. discusses some of the limitations of this formalization. We conclude with some observations and possible directions for future research.
Notation: T h r o u g h o u t this article ]p3 denotes the 3-dimensional projective space over the field of real numbers lR. We recall that an element or projective point of lP 3 is a one dimensional vector subspace of lR 4 and t h a t a projective line in ]p3 is a two dimensional vector subspace of IR4. For each vector x E lR4 \ {0} we denote by ~ the corresponding projective point. Given two distinct projective points f l , ]2 E lP 3 we denote by ]1 $ f2 the projective line defined by f l and f2- Ker(T), the kernel of the function T, is the set of points x in the domain of T such that T(x) = 0. For a general sets, we distinguish strict set inclusion from nonstrict inclusion by using C and C_, respectively. We also denote the Cartesian product of a set $, as S n ~ S • S • . . . • $. n t~mes
1 Radu Horaud independently suggested this question to one of the authors at the 1995 ICCV conference.
What Can Be Done with an Uncalibrated Stereo System?
81
2. T a s k E n c o d i n g s We consider specifically the problem of controlling the pose x of a robot t h a t moves in a prescribed workspace YV C S E ( 3 ) using two cameras. T h e d a t a to accomplish this task consists of the projections of point-features attached to the robot as well as point-features attached to the environment that appear in the two-camera field of view V. Typically ~ is a subset of ]R 3 or ]p3. Point-features are mapped into the two-camera image space y through a fixed but imprecisely known two-camera model Cactu~l : ]2 --~ Y, where y is a subset of IR 2 x ]R2 or ]p2 • ]p2. If f denotes a vector in ~2, then the feature position in image space is a measured output vector y E Y related to f by the readout formula y = C a c t u a l ( f ) . The two-camera model Cactual is a fixed but unknown element of a prescribed set of functions C which map ]2 into y . In the sequel C is called the set of admissible two-camera models. Throughout this article, we assume that all elements of C are injective, and therefore left-invertible, over ~. 2.1 T a s k s
A typical problem or task consists of driving the pose of a robot to a "target" (or set) in }iV. Both the pose of the robot and the target are determined by a list of simultaneously observed point-features in ]). Following [11, 8], we opt for an implicit representation of a task in the spirit of [14] and define a task function to be any function T which maps an ordered set (or list) of n simultaneously observed point-features f ~ {fl, f 2 , . . - , fn} {fi E ]2} into a real scalar. In some cases only certain sets of features are allowed (e.g., for n = 3 the three point-features may be required to be collinear). The set ~" C_ ~'~ of all such lists of interest is called the feature space. The task is to control the pose of a robot in VV so as to achieve the condition T ( f ) = 0.
(2.1)
Examples of tasks expressed in this form can be found in [14, 2, 3, 2, 11, 8, 13]. The problem addressed in this paper is to determine conditions under which it is possible to verify that (2.1) holds using /xonly the d a t a t h a t can be sensed by the two cameras. These data are y = { Y l , Y 2 , . . . , Y n } with Yi = Cactu~l(fi), subject to the assumption that Cactual is in C but it is otherwise unknown. For compactness of notation, in the remainder of this article we define for each C in C the function C' from ~" to the set y n by the rule {A, f 2 , . . . , Y,} ~ { C ( s
C(Y2),..., C(A)}
Therefore the data that can be sensed by the two cameras is simply
v = O~l(f)
(2.2)
82
Jo~o Hespanha, Zachary Dodds, Gregory D. Hager, and A.S. Morse
2.2 Task E n c o d i n g
Since we presume that Cactual is not known with precision, the exact locations of the points fi cannot be determined from observed data. Therefore it is unclear whether a given task expressed in the form of (2.1) can be accomplished. In this section we discuss three "task encodings" for implementing tasks in the form of (2.1). Each task encoding is a paradigm for expressing task constraints based on available information, i.e., a task encoding explicitly connects a system's goal, the task, with its resources, the available visual data. We then state necessary and sufficient conditions on the admissible set of camera models such that satisfying each of the encodings is equivalent to accomplishing the original task (2.1).
Cartesian-Based Approach. This approach, often referred to as "positionbased" in the visual servoing literature, is motivated by the intuitive idea of "certainty equivalence." In the present context, certainty equivalence advocates that one should use estimates of f to accomplish task (2.1), with the understanding that these estimates are to be viewed as correct even though they may not be. The construction of such estimates starts with the selection (by some means) of a two-camera model Cestim in C which is considered to be an approximate model of Cactual. Assume that such a Cestirn has been chosen. Using Cestim, we estimate f using a left inverse Cestirn --1 of Cestirn" Such a left inverse exists because Cestim is injective. In view of (2.2), it is natural to define -1 Cestim(Y) (2.3) as the estimate of f to be considered. In accordance with certainty equivalence, we can then seek to accomplish the re-encoded task T(e~tlim(Y)) = 0.
(2.4)
Condition (2.4) expresses the Cartesian-based encoding of the original task (2.1). We would like to determine those cases when accomplishing the reencoded task (2.4) is equivalent to accomplishing (2.1). The following lemma provides an answer. L e m m a 2.1. For a fixed set of admissible models C, define the set of Cartesianbased tasks, CBc, to be the set of all task functions T for which there exists a left inverse Cesti --1 m such that for all Cactual 9 C, g e r T = ger(T O Cestim--1o Cactual).
(2.5)
Then, for each f 9 Y: and Cactual 9 C, equations (2.4) and (2.2) are equivalent to (2.1) if and only if T 9 CBc. The details of the proofs of Lemma 2.1 and subsequent lemmas are provided in [10].
What Can Be Done with an Uncalibrated Stereo System?
83
2.2.1 I m a g e - B a s e d Approach. Whereas the Cartesian-based encoding uses reconstructed estimates of points in ~ to implement tasks, a task can also be expressed solelyin the space of the sensor. Such an encoding is known as image-based. This well-known approach requires that there exist a function : yn ~ ]R such that
KerT = Ker(T o C),
VC E C
(2.6)
We assume that (2.6) is satisfied by some computable function T. The image-based approach Seeks to accomplish i/~(y) = 0.
(2.7)
The approach is justified by the following lemma. L e m m a 2.2. For a fixed set of camera models C, define the set of imagebased tasks, IBc, to be the set of all tasks T such that (2.6) holds for some T. Then for each f e :F and Cactual E C, equations (P.7) and (2.2) are equivalent to (2.1) if and only if T E IBc. As we will see, e.g., for projective camera models, demanding that (2.6) holds turns out to be much less severe than requiring that the property (2.5) holds. For example, for a suitably restricted set of camera models C' it is possible to demonstrate that Tcol E IBc,. In a sense to be specified in the next section (2.6) is actually a necessary condition for a problem to be solvable, and hence IBc is in this sense the largest set of tasks that can be performed (relative to a given set of admissible camera models C) using cameras as sensors. Note that, although the image-based approach does not require the selection of a candidate two-camera model Cestim in C to serve as an estimate of Cactual, in practice designing a controller that accomplishes (2.7) will typically require such an estimate. It is important to realize, however, that the choice of camera model affects the dynamics (in particular the stability) of the closed loop system, but not its accuracy. [8] 2.2.2 M o d i f i e d C a r t e s i a n - B a s e d A p p r o a c h . An image-based task encoding, as shown in the next section, is a maximally powerful encoding. The Cartesian-based offers the advantage of expressing system control in threedimensional Euclidean space, the natural choice for specifying robot tasks. The modified Cartesian-based encoding, introduced in [1], is essentially a compromise between these two approaches. A modified Cartesian-based encoding resembles Cartesian-based in that ----1 it uses the left inverse of a camera model estimate, Cestirn , t o obtain approximated positions of image points. Similar to image-based encoding, the modified Cartesian-based paradigm also depends on the existence of two functions, T' : .T -~ IR and R : y'* -~ y n , such that KerT = Ker(T' o Cesti m o R o C ) ,
'y'C E C.
(2.8)
84
Jo~o Hespanha, Zachary Dodds, Gregory D. Hager, and A.S. Morse
Informally, R can be thought of as an arbitrary image-based construction of some or all of the observed points. T h a t construction, perhaps along with the rest of the observed points, is reconstructed in Cartesian space. There, the resulting points are considered correctly configured when in the kernel of T I" With the necessary functions in place, the modified Cartesian-based encoding seeks to achieve !
---1
T (C~stim(R(y))) = 0.
(2.9)
As before, the following lemma defines the set of tasks performable with a modified Cartesian-based task encoding. L e m m a 2.3. For a fixed set of camera models C, define the set of modified Cartesian-based tasks, M C B c , to be the set of all tasks T for which there exists a left inverse C--1 e s t i m such that (2.8) holds for some T' and R. Then for each f E iT and Cactual 9 C, equations (2.9) and (2.2) are equivalent to (2.1) if and o n l y / f T 9 M C B c . With the definitions above we can immediately state some relationships among sets of performable tasks: L e m m a 2.4. For fixed C, C B c C M C B c C I B c . Setting R to the identity and T ~ to T in_(2.9) suffices to show the first set inclusion. The second follows by making T equal to T ~ o C e s t i m o R in (2.6).
3. Testable
Task
Functions
In the previous section we discussed methods to re-encode a task specified in the form of (2.1) using only the d a t a that can be sensed by two cameras. In this section we derive a minimum set of assumptions under which it is possible /x to verify if condition (2.1) holds using the sensed d a t a y = Cactual(f), even though Cactual is an unknown element of C. In the sequel we define a task function T : ~" -+ ]R to be testable on C if there exists an algorithm A which, for each two-camera model C E C and each f E iT is able to determine whether or not T ( f ) = 0 from the measured d a t a y = C ( f ) . This notion of testability is meant to capture the idea of task performability without regard to the details of a control procedure. We proceed by deriving necessary and sufficient conditions characterizing those task functions which are testable. L e m m a 3.1 ( T e s t a b i l i t y ) . The following statements are equivalent:
1. The task function T : iT -+ lR is testable on C.
What Can Be Done with an Uncalibrated Stereo System?
85
2. There exists a map T : y n __+]R such that K e r T = g e r ( T o C),
VC e g.
(3.1)
If T is testable on g, and T is as in (3.1), an algorithm t h a t can be used to determine if T ( f ) = 0 is
then
T(f)=O
e se T ( D # O .
Because of this we follow [1] to say t h a t T ( y ) = 0 is an encoding of the task T ( f ) = O. Note t h a t the equivalence of conditions 1 and 2 implies t h a t the set of testable tasks is exactly the set of image-based tasks and hence I B c contains all tasks which can be performed for a given set of admissible c a m e r a models. We also note the following a b o u t the continuity of the encoding function, T. Suppose t h a t Y is a compact subset of a finite dimensional space, T is a continuous function, and g is of the form g -- {Cp : p E ~P} with P a c o m p a c t subset of a finite dimensional space. If {p, f ) ~-+ C p ( f ) is a function continuous on P • ~, then, without loss of generality, the function T in s t a t e m e n t 2 of L e m m a 3.1 can be assumed continuous [10].
4. Performability
for Projective
Systems
Until now, all results pertaining to the performability of positioning tasks were true for any set of admissible c a m e r a models with the a p p r o p r i a t e structure. In this section, we focus specifically on cameras described by projective models. For these cameras, we assume a field of view ~2 which is a subset of lP 3 of the form
where ]2• is a bounded subset of ]R 3 and the image space y is ]p2 • ]p2. Notation: We denote by GL(4) the general linear group of nonsingular 4 • 4 real matrices. Given a matrix A" E GL(4) we denote by (A) the action of A on IP 3 which is a function from IP 3 to ]p3 defined by the rule lRx ~-+ IRAx. T h e function (A) is called a projective transformation. Similarly, given a 3 • 4 full rank m a t r i x A, we denote by (A) the function from IP 3 \ KerA to IP 2 defined by the rule lRx ~ lRAx.
4.1 P r o j e c t i v e C a m e r a M o d e l s Given two 3 • 4 full r a n k matrices M, N such t h a t K e r M a n d K e r N are not projective points in ~2 we say t h a t C is the projective two-camera model defined by the pair {M, N } and write C = {(M), (N)} if C is the m a p from to y defined by the rule
86
Jo~o Hespanha, Zachary Dodds, Gregory D. Hager, and A.S. Morse f ~4 { ( M ) ( f ) , ( N ) ( f ) }
The projective points defined by K e r M and K e r N are the optical centers of the cameras, and the projective line defined by these projective points, i.e. K e r M @ K e r N , is the baseline of the cameras. Our requirement of injectivity is that the optical centers of the two cameras must be distinct and that the cameras' baseline does not intersect )2. In particular, we consider two sets of admissible two-camera models. The first contains uncalibrated cameras; the second, weakly calibrated rigs.
Uncalibrated projective two-camera system. Let Cuc denote the set of every projective two-camera model C -- ( ( M ) , ( N ) ) defined by any pair of full rank matrices M, N E ]R3x4 such that the cameras' centers are distinct and their baseline does not intersect )2. This situation is often referred to as an uncalibrated projective two-camera system. Weakly calibrated projective two-camera system. Let Co -- {(M0), (No)} be some element of Cuc. We denote by Cwk(C0) the subset of Cuc consisting of all those camera models in Cuc which have the same epipolar geometry as Co. The elements of Cwk(Co) are projective two-camera models C -- {(MoA), (NoA)} where A can be any nonsingular matrix such that the cameras' centers are distinct and their baseline does not intersect 1;. This situation is often referred to as a weakly calibrated projective two-camera system. Clearly, Cwk(C0) C Cuc for each Co E Cuc. For the rest of the paper we assume t h a t a suitable epipolar geometry is given and fixed and write Cwk without the argument Co. 4.2 T a s k I n v a r i a n c e We further characterize the set of testable tasks by introducing the notion of invariant tasks: D e f i n i t i o n 4.1. We say a task function T : jr __~ ]R is invariant under projective transformations if for every nonsingular matrix A E GL(4) and all length-n lists of features ], g E Jr such that
(A)(fi) = gi,
i = 1,2,...,n
it is true that T ( f ) = 0 if and only if T(g) = O. Let P I denote the set of all task functions which are invariant under projective transformations. With these definitions in place, it is possible to prove the following statements: L e m m a 4.1 ( T e s t a b i l i t y o n Cwk). A task function T is testable on Cwk if and only if T E P I , i.e., T is invariant under projective transformations. L e m m a 4.2 ( T e s t a b i l i t y o n Cuc). A necessary condition for a task ]unction T to be testable on Cuc is that T E P I .
What Can Be Done with an Uncalibrated Stereo System?
87
Taken together, these results can be summarized as follows: T h e o r e m 4.1. For sets of camera models and tasks as defined above: IBcuo C P I = IBcwk Strictness of the first inclusion follows from the example task of making four point features coplanar: that is, consider a task function Tcp which takes four points as input and which equals zero only when those four points are coplanar. The fact that Top E IBcwk is shown in [6]. The fact that Top ~ IBcur is shown by constructing an example of two admissible camera models C and C' and two sets of four points, f coplanar and f~ not coplanar, such that C'(f) = 0 ' ( f ' ) [10]. In short, a weakly calibrated camera system can perform all (point-based) tasks which are invariant to projective transformations. As noted above, this set also defines the upper limit of performable tasks. Uncalibrated cameras can perform a strictly smaller set of tasks. In light of Lemma 4.1, we know that the previously mentioned coplanarity task, Tcp, is an example of a projectively invariant task which is not in IBc~o.
5. S o m e
Limitations
of the Formulation
Surprisingly, to date we know of no example which differentiates IBcuc and CBcuo within the framework presented in this paper, although in practice we have implemented several tasks which can be performed with absolute accuracy only by image-based methods [8]. As hinted above, this incongruity has to do with the richness of the set of camera models and feature lists. In practice, the set of camera models is smaller than allowed here - in particular, visual singularities are removed from the set of admissible camera models. However, in the general case presented here such singularities make tasks such as Tcol impossible to reconcile with the requirement that an imagebased encoding must be equivalent to its underlying task for every possible camera pair and feature list. The distinctions among the three task encodings also need to be tightened in two ways: both as defined here and for more specific cameras and feature lists. If we suitably restrict the viewspace 12 and the set of camera models to remove "visual singularities", it is possible to separate IBcuc from CBc~c, (e.g., Tco0 as well as MCBcuo from CBc~o. A task Tql which separates the latter two sets is one in which a point is to be positioned at the center (i.e., the intersection of the diagonals) of a planar quadrilateral whose corners are observed. For this task the ability to construct the unseen goal point in the image allows precise positioning via M C B but differing reconstructive errors do not allow precision with C B [1]. In all of these cases, however, the additional restrictions on admissible cameras and usable feature points expand the set of performable tasks. This
88
Jo~o Hespanha, Zachary Dodds, Gregory D. Hager, and A.S. Morse
expansion may introduce performable tasks which are not in P I . The formalism currently lacks a systematic way to reduce the features and cameras under consideration. Such a system would allow us to specify if and when the links in the chains of Lemma 2.4 merge into equivalent sets. Most fundamentally, we do not, to date, know of a general way of suitably weakening the definition of performability in order to explore the structure of the task space framed in this work.
6. C o n c l u s i o n s The main results of this paper can be briefly summarized as follows: - A task is performable with absolute accuracy with weakly calibrated cameras if and only if it is invariant to projective transformations. This set of tasks is therefore maximal for weakly calibrated cameras. - If a task is performable with absolute accuracy with uncalibrated cameras for which no epipolar information can be computed, then it is invariant to projective transformations. This set of tasks is strictly smaller t h a n t h a t for weakly calibrated cameras. - The set of tasks which an image-based encoding can perform is maximal among the three encoding approaches presented here. In addition, imagebased tasks are, in fact, everything t h a t can be done only using measured data. It is unclear at this point if in general IBr is strictly larger than the set of tasks performable using Cartesian (position-based) or modified Cartesian methods. Section 5. provides an outline of the future directions we plan with this work. Up to now, the set of cameras considered admissible is, in some sense, too large. It covers many camera configurations that can be safely assumed not to occur in practice. We are working to extend the results to draw finer distinctions among the hierarchy of task spaces by diminishing the size of C. Additional complications arise on this path, as the sets of camera configurations which we would like to exclude often depend on the positions of the feature points used or the specification of the task itself. The use of points as the basis for task specification simplifies the notation of encodings, but more general features have been used in [1]. The set of point-to-point tasks is a rich one, as it encompasses many of the examples found in the literature. In addition, tasks involving lines and conics can often be constructed from observed point features, e.g., Tool and Tql. Our results seek to provide insight into what tasks are performable using the information provided by uncalibrated stereo. In reality, tasks lie along a spectrum of performability, where errors due to noise, the control algorithm, and sensor singularities contribute to imprecise execution. This paper's framework classifies tasks as belonging (or not belonging) to a particular set. The
What Can Be Done with an Uncalibrated Stereo System? notion of "distance" pelling extension to whether continuous tives by guiding the tasks.
89
from a particular set, for example, would provide a comthe binary relationship of inclusion. Such taxonomies, or discrete, assist human or automated system execuchoice of an approach for accomplishing a certain set of
Acknowledgement. This work was supported by National Science Foundation grant IRI-9420982, and by funds provided by Yale University. Thanks to Radu Horaud and David Kriegman for providing valuable encouragement to pursue this problem. This paper originally appeared in the 1998 IEEE Int. Conf. on Robotics and Automation.
References 1. W-C. Chang, J. P. Hespanha, A.S. Morse, and G.D. Hager. Task re-encoding in vision-based control systems. In Proceedings, Conference on Design and Control, volume to appear, San Diego, CA, December 1997. 2. F. Chanmette, E. Malis, and S. Boudet. 2d 1/2 visual servoing with respect to a planar object. In Proc. IROS Workshop on New Trends in Image-based Robot Servoing, pages 43-52, 1997. 3. F. Chaumette, P. Rives, and B. Espiau. Classification and realization of the different vision-based tasks. In K. Hashimoto, editor, Visual Servoing, pages 199-228. World Scientific, 1994. 4. P. I. Corke. Visual control of robot manipulators--a review. In K. Hashimoto, editor, Visual Servoing, pages 1-32. World Scientific, 1994. 5. B. Espiau, F. Chanmette, and P. Rives. A New Approach to Visual Servoing in Robotics. IEEE Trans. on Robotics and Automation, 8:313-326, 1992. 6. Olivier D. Faugeras. What can be seen in three dimensions with an uncalibrated stereo rig? In Proc., ECCV, pages 563-578, 1992. 7. G. D. Hager. Real-time feature tracking and projective invariance as a basis for hand-eye coordination. In Proc. IEEE Conf. Comp. Vision and Part. Recog., pages 533-539. IEEE Computer Society Press, 1994. 8. G. D. Hager. A modular system for robust hand-eye coordination. IEEE Trans. Robot. Automat, 13(4):582-595, 1997. 9. G. D. Hager and Z. Dodds. A projective framework for constructing accurate hand-eye systems. In Proc. IROS Workshop on New Trends in Image-based Robot Servoing, pages 71-82, 1997. 10. J. Hespanha, Z. Dodds, G. D. Hager, and A. S. Morse. What visual tasks are performable without calibration? DCS RR-1138, Yale University, New Haven, CT, October 1997. 11. S. Hutchinson, G. D. Hager, and P. Corke. A tutorial introduction to visual servo control. IEEE Trans. Robot. Automat, 12(5), 1996. 12. M. Jagersa~d, O. Fuentes, and R. Nelson. Experimental evaluation of uncalibrated visual servoing for precision manipulation. In Proc., ICRA, pages 2874-2880, 1997. 13. L. Robert, C. Zeller, and O. Faugeras. Applications of non-metric vision to some visually guided robotics tasks. Technical Report 2584, INRIA, SophiaAntipolis, June 1995. 14. C. Samson, M. Le Borgne, and B. Espiau. Robot Control: The Task Function Approach. Clarendon Press, Oxford, England, 1992.
V i s u a l Tracking of P o i n t s as E s t i m a t i o n on t h e Unit Sphere Alessandro Chiuso and Giorgio Picci Dipartimento di Elettronica e Informatica Universit~ di Padova Padua, Italy S u m m a r y . In this paper we consider the problem of estimating the direction of points moving in space from noisy projections. This problem occurs in computer vision and has traditionally been treated by ad hoc statistical methods in the literature. We formulate it as a Bayesian estimation problem on the unit sphere and we discuss a natural probabilistic structure which makes this estimation problem tractable. Exact recursive solutions are given for sequential observations of a fixed target point, while for a moving object we provide optimal approximate solutions which are very simple and similar to the Kalman Filter recursions. We believe that the proposed method has a potential for generalization to more complicated situations. These include situations where the observed object is formed by a set of rigidly connected feature points of a scene in relative motion with respect to the observer or the case where we may want to track a moving straight line, a moving plane or points constrained on a plane, or, more generally, points belonging to some smooth curve or surface moving in R 3. These problems have a more complicate geometric structure which we plan to analyze in future work. Here, rather than the geometry, we shall concentrate on the fundamental statistical aspects of the problem.
1. I n t r o d u c t i o n The operation of perspective projection onto the image plane of an ideal (pinhole) c a m e r a can be described geometrically as the intersection of the rays ( straight homogeneous lines) emanating from the optical center of the camera, connecting to the observed object in ]I{3, with the image plane. In practice the projections are noisy and the detected feature points on the image plane do not correspond exactly to straight projections of the real feature points. This occurs because of distortion of the optical systems and noise of various kinds entering the signal detection and the processing of the electronic image formed on the CCD array. For these reasons, the task of reconstructing the location in 3-D o f . a n observed object from its noisy projections on the image plane, is a non trivial problem which should be treated by a p p r o p r i a t e statistical methods. So far only ad hoc estimation methods have been used (most of the times variations of the Extended K a l m a n filter) with generally poor performance and serious divergence problems. A sound statistical analysis of the problem has been lacking. Since we cannot measure distances along the projecting rays and we m a y at most recover the feature points modulo distance from the optical center,
Visual Tracking of Points as Estimation on the Unit Sphere
91
observing points moving in 3-space by a monocular vision system, is really like observing points in the projective space or, equivalently, points lying on the unit sphere in R3. This is well known. Also we may, without any loss of information, normalize the vector joining the optical center to the measured projections on the image plane to unit length. In fact, both the unaccessible measured points and the observed projections on the image plane, may be thought of as directions and represented, say, by the coordinates of the corresponding unit vectors x and Yk on a unit sphere. In this formulation, x is the true unknown direction pointing at the observed point in ~3 and Yk, k -- 1 , . . . , m (all vectors on the unit sphere) are noisy measurements of the direction x. The precise nature of the observation noise will be discussed later, however it should be clear that the way the noise affects the ideal perspective projection x cannot be additive and a realistic formulation of the problem must depart sharply from the standard linear-Gaussian setup. In the simplest case the observed feature objects are points moving unconstrained in ~3 and their directions are unit vectors which live naturally on the two-dimensional (unit) sphere S 2. Hence the problem is formulated as estimation on the unit sphere. Some other perspective estimation problems, for example recovering lines moving in ]~3 by observing their projections on the image plane, give rise to estimation on higher dimensional manifolds such as the Grassmannian. More general examples of estimation problems on manifolds which occur in computer vision are discussed in the work of Ghosh [5, 6] and Soatto et al [13, 14]. At this point we need to discuss probability distributions on the unit sphere.
2. T h e
Langevin
Distribution
A family" of probability distributions on the sphere which has many desirable properties is defined by the Langevin density p(x) - 4 r s i n h ~ exp a#'x,
IIx]l = 1
(2.1)
with respect to the spherical surface measure da = sin OdOdr The vector parameter ~ E S 2 ( /~ is conventionally normalized to unit length) is the mode of the distribution, while the nonnegative number a > 0 is called the concentration of the distribution. For ~ -~ 0 the density becomes the uniform distribution while for a -r cx), p tends to a Dirac distribution concentrated at x = ~u. The density function (2.1), denoted L ( # , a ) , was introduced by Langevin (1905) in his statistical-mechanical model of of magnetism [7]. Since then it has been rediscovered and used in statistics by a number of people, see [15]. Observe t h a t the Langevin distributions form a one-parameter exponential family and that the family is invariant with respect to rotations, in the sense that, if x is Langevin distributed, then for any R E SO(3) the
92
Alessandro Chiuso and Giorgio Picci
random vector y :-- R x still has a Langevin distribution with the same concentration parameter as x and mode parameter R#. An important p r o p e r t y of the Langevin distribution is the preservation of the functional form under multiplication L(#I, ~1) L(p2, a2) = L(#, a) (2.2) where L(/z, ~) is Langevin with parameters ~1 ]~1 "~" ~2 ~2,
#=
I[tcxpl+tr
tc=l[~1#1+tr
(2.3)
Introducing a coordinate system with unit vectors el, e2, e3 = # and spherical polar coordinates (0, r relative to this frame, we have x = sin 0 cos eel -4- sin 0 sin r
+ cos 0e3
whereby (2.1) can be rewritten in the form -
-
p(O, r - 47r sinh
exptccos0
0<0
Hence L(tt, ~) is rotationally symmetric around its mode #. The expression given in (2.1) is for a Langevin distribution on the unit sphere in ~3. For higher dimensions, the normalization constant has a slightly more complicate expression. The Langevin distribution on S '~-l, n > 3, is ~(n/2--1)
p(x) = (27c)n/2In/2_l(tr
expa#'x,
I1~11 =
1
(2.4)
where In~2_ 1 (X) is a modified Bessel function of the first kind. More generally, an arbitrary probability density functions on S n-1 can be expressed as the exponential of a finite expansion in spherical harmonics. These are discussed, for example, in [15, p. 80-88]. In this sense the Langevin density is a sort of "first order" approximation as only the first spherical harmonic, cos tT, is retained in the expansion and the others are assumed to be negligible. A more general approach than the one followed here could be to consider densities which are exponential of a finite sum of spherical harmonics. These are of exponential type, also have a set of finite dimensional sufficient statistics and could possibly be treated by generalizing what we do in this paper. We shall leave this to a future investigation. Also, most of the times in this paper we discuss three dimensional problems (n=3) only. The generalization to arbitrary n is really straightforward and can be left to the reader. Rotation-invariant distributions like the Langevin distribution are natural for describing random rotations. Let x be a fixed direction, represented as a point in $2, which is observed by a camera. The observation mechanism perturbs x in a r a n d o m way ( say because of lens distortion, pixel granularity etc). Since the o u t p u t of the sensor, y, is also a direction represented by a vector of unit length, the
Visual Tracking of Points as Estimation on the Unit Sphere
93
perturbation may always be seen as a random rotation described by a r a n d o m rotation matrix 1 R = R(p) E SO(3), where p is the polar vector of the rotation, i.e. R(p) := exp{pA} so t h a t y := R(p) x
(2.5)
In other words we can always model the noise affecting x as multiplication by a rotation matrix. The action of the "rotational observation noise" on directions x E S 2 can in turn be described probabilistically by the conditional density function p(y I x = x) of finding the observation directed about a point y on the sphere, given that the "true" observed direction was x -- x. A very reasonable unimodal conditional distribution, rotationally symmetric around the starting direction x (no angular bias introduced by the observing device) is the Langevin-type density,
P(Y I x) - 4~ sinh
exp ~x'y
(2.6)
In this framework we may think of the ordinary distribution L(#, a) as a conditional density evaluated at a known point x -- #. Note that, since #ly is just the cosine of the angle between the unit vectors # and y on the sphere, the values of the conditional probability distribution p(y Ix) are invariant with respect to the action of the rotation group SO(3) on S 2, i.e. with respect to coordinate change on the sphere.
The Angular Gaussian Distribution The main reason to work with the Langevin class of distribution functions on the sphere is that its properties are the natural analog of the properties of Gaussian distributions on an Euclidean space. There are various a t t e m p t s in the literature to derive the Langevin distribution as the distribution function of some natural transformation of a Gaussian vector. Perhaps the easiest result in this direction is the observation, first made by Fisher [4], t h a t the distribution of a normal random vector x with isotropic distribution Af(#, a2I), conditional on the event {[[x[I -- 1 } is Langevin with mode #/[[~u[I and concentration parameter [[#[[/a 2. A more useful result, discussed in [15, Appendix C] is the remarkable similarity of the so-called Angular Gaussian distribution to the Langevin. The angular Gaussian is the probability density of the direction vector x := ~/1[~[[ when ~ has an isotropic Gaussian distribution, i.e. ~ ~ H ( # , a2I). The distribution is obtained by computing the marginal of Af(#, a2I) on the unit sphere [Ix[[ = 1. It is shown in [15, Appendix C] that the angular Gaussian is a convex combination of Langevin densities with varying concentration parameter s, 1 The wedge A denotes cross product.
94
Alessandro Chiuso and Giorgio Picci
Ag(x) = N l
dO
s n - l e - 2 1-.= ~-~eSX, xds'
A=
u
~-~
a = Ilull
and it is seen from this formula that Ag depends on # , a 2 only through the two parameters A and a. We shall denote it by Ag(A, a2). T h e notation is convenient, since for either moderate or large values 2 of a, Ag(A, a 2) is, to all practical purposes, the same thing as L(A, ~), where
Ilull 2 =
Ib'll
:=
(2.7)
o2
Note that all distributions A/'(p#, p2o2I), p > 0, give origin to the same angular Gaussian as A/'(#, o2I). (This precisely is the family of isotropic Gaussians generating the same angular distribution.) The role of the angular Gaussian in modeling directional observations can be illustrated by the following example. Let ~, ( be independent Gaussian isotropic random vectors with ~ ,~ Af(#, a2I), ~ ,~ .Af(O, a2I) and assume we observe the direction of the vector
= C~ + fi ... A/'(#, o 2 C C ' + a2I).
(2.8)
If C is an orthogonal matrix, CC' = I and the distribution of ~ is isotropic Gaussian. Denoting y := ~/[[y[[ we have y ~-. L(/~/[[#[[, ~ ) . It is easy to see that the conditional density p(y [ [) is also angular Gaussian. In fact this follows since the conditional distribution of y given 2 Hence = ( is Gaussian with mean C [ and variance o z.
p(y [ ~ = ~) = Ag(C[/[[~[[, [[C~H2/oz 2) -- A g ( C x , [[~[[2/O2)
(2.9)
where x is the direction vector of ~. We are interested in the conditional density p(y [ x). We shall state the result in a formal way as follows. 2 of y given ~ = [ is proporP r o p o s i t i o n 2.1. If the conditional variance Oz, tional to [[~[[2, i.e. a 2 = O02[[~H2, then the conditional density p(y [ x) for the model (2.8) is angular Gaussian. P r o o f 2.1. Denote r := [[~[[. Then the claim follows from
p(y Ix) =
/5
r p(y I x, r)p(r I x ) a t
since p(y, r I x) = p(y I x, r ) p ( r ] x) = p(y I ( ) p ( r ] x) and in the stated assumption, p(y I x, r) does not depend on r. 2 "Moderate or large" here means that ,~ :--- a 2 should be greater than, say, 100 in order to have a fit within a few percent of the values of the two functions. In fact the angular Gaussian approximates a Langevin distribution also for a small, when both of them are close to uniform, but the relation between a and is different.
Visual Tracking of Points as Estimation on the Unit Sphere
95
Best Approximation by a Langevin distribution Let P(x) be an arbitrary probability measure on the unit sphere, absolutely continuous with respect to the surface measure da = sinO dO d~; we want to approximate the density p(x) -- dP/da by means of a density of the Langevin type, i.e. by a density in the class s
= {/(x) - 4~rsinh(~)exp{~#Tx}
'
'~ -> 0, I]#1] = 1};
(2.10)
using as a criterion of fit the Kullback's pseudo-distance, defined by
g(p,g(t,,,~))= Epln
p(x)
_ L2p(x)ln
p(x )
dax
(2.11)
The problem is to find the minimum: min K(p, g0)
#EO
(2.12)
where
o = {0 = (~, ~) : ~ _> 0, I1,11 -- 1} Introducing Lagrange multipliers
Ap(O) = K(p,Q) + A~(0) where
(2.13)
3
i=l
and taking derivatives with respect to # and a it can be shown that the minimum is attained for: { cosh~
1
#'rex = 0
(2.14)
am~ -- A# = 0 where mx is the mean vector of P mx
= fs2 xp(x) dax.
(2.15)
More explicitly, the optimal # and tr are given by:
{cosh~
1 _]]mxl[
sinh r~rnx ~
(2.16)
-IIm~ll Note that for a Langevin density, the parameters (#, ~) are completely determined by the mean vector m (i.e. there is a one-one correspondence between m and (#, a)), as it is easily checked that
96
Alessandro Chiuso and Giorgio Picci { cosh~
1 _ [[m][ (2.17) _
Ijmll
Hence our approximation problem is solved simply by equating the mean vectors of the two distributions. In other words the only thing we need to know to find the best Langevin approximant of P is its mean vector. This result leads to a kind of wide-sense estimation theory on spheres with the mean parameter playing the same role of the second order statistics in the Gaussian case. Note that here both the mode (i.e. the "most probable direction") and the concentration parameter (telling us how d a t a are scattered about the mode) are deducible from the mean. Obviously one expects reasonable results from this wide-sense theory only when the density to be approximated is unimodal and approximately symmetric around the mode.
3. M A P
Estimation
This section is taken literally from [12] in order to make the paper selfcontained. Assuming that the a priori model for x is of the Langevin type say, x =... L(xo, no) and assuming independence of x and p, we can form the a posteriori distribution p ( x l y ) by Bayes rule. The joint density is p ( x , y) = p ( y ] x ) p ( x ) --= A(a, n0) exp g/]'x where /~
no
A(n, no) = 47r sinh n 47r sinh n0 f~'x := ~y'x + ~oX'oX. Note that ~ = k(y, xo), f~ = p(y, Xo) are functions of y and of the a priori mode x0 uniquely defined by the condition ][/~H = 1. These functions are explicitly given by /~ . - ny + n0x0
k := Hay + r~ox0]].
Dividing by the marginal p(y) =
~s
p(x, y) dax = A(a, ~o)
47r sinh k :
one obtains the a posteriori density k p ( x l Y) = 4~r sinh k exp k(y)/2'(y)x
(3.1)
Visual Tracking of Points as Estimation on the Unit Sphere
97
which is still Langevin. The conditional mode vector t~(y) and the conditional concentration k(y) are still given by formula (3.1). The Bayesian Maximum a Posteriori estimate of x, given the observation y is trivial to compute in this case. These formulas can, in certain cases, be generalized to the dynamic case and lead to nonlinear versions of the Kalman filter. We shall consider the simplest situation below. S e q u e n t i a l O b s e r v a t i o n s o f a F i x e d Target Assume we have a sequence of observations y(t) := R ( p ( t ) ) x = e x p { p ( t ) A } x
t = 1,2,...
(3.2)
where the p's are identically distributed independent random rotations which are also independent of the random vector x. The y(t)'s are conditionally independent given x, and p(y(t) Ix) = L(x, ~), where ~ is the concentration parameter of the angular noise. Hence, denoting yt := [ y ( 1 ) , . . . , y ( t ) ] ~ we may write /~t
p(y*l x) --
t
(47r sinh a)t exp a(x, E
y(s))
(3.3)
s=l
where (., .) denotes inner product i n II~3 . Assuming an a priori density of the same class, x ,,, L(xo, n0), one readily obtains the a posteriori measure
k ( t ) .expk(t)(ft(t) x) k(t))
p(x ] yt) -
(4~ sinh
(3.4)
which is still of the Langevin class with parameters t
1
~(t) = ~--~(~ Z y(~) + ~oxo)
(3.5)
t
~(t) = I 1 ~
y(s) + ~oxoll
(3.6)
These formulas can be easily updated for adjunction of the t + 1-st measurement. At time t + 1 one obtains, t
1 ~(t + 11 = 1------~ k(t + ( a E y ( s )
+ aoxo +
ay(t + i l l
8----1 t
~(t + 1) = I 1 ~
y(s) + ~ox0 + ~y(t + 1)1]
8----1
= II~(tlP(t) + ~y(t + i)ll
98
Alessandro Chiuso and Giorgio Picci
which look like a nonlinear version of the usual "Kalman-Filter" updates for the sample mean which one would obtain in the Gaussian case. P r o p o s i t i o n 3.1. The MAP estimate (conditional mode) fi(t), of the fixed random direction x observed corrupted by independent angular noise {p(t)} of concentration n, propagates in time according to 1 /~(t + 1) - k(t + 1-------~ (k(t)ft(t) + ~y(t + 1))
(3.7)
k(t + 1) = Ilk(t)ft(t) + ~y(t + 1)H
(3.8)
with initial conditions fi(0) = x0 and k(0) = no.
4. Dynamic estimation The next task is to generalize the recursive MAP estimation to the case of a moving target point. Dynamic Bayes formulas Consider a stationary Markov process on the sphere x(t) and denote by
p(xt ly t) the a posteriori density given the observations yr. A standard application of Bayes rule see e.g. [16, p.174] provides the following formulas 1
p(xt+l ly t+: ) = ~ P(yt+l Ixt+:) p(xt+: ly t) p(xt+l lYt) =
Jsf2
p(Xt+l IXt) p(zt ly t) dax,
(4.1) (4.2)
where N is a normalization constant. Note that if both the observation noise model and the a priori conditional density p(xt+: ly t) are Langevin-like, so is the a posteriori density
p(x~+l lyTM).
In this ideal situation the evolution of the conditional mode ft(tlt ) of p(xtly t) is described by formulas analogous to (3.7), (3.8), i.e.
p(t + lit +
i)=
k(t +
1
lit
+
l)(k(t+
l l O # ( t + lit ) +
+~y(t + 1)) ~;(t + lit + 1) = II,~(t + llt)p,(t + lit ) + ,u(t + 1)11
(4.3) (4.4)
Moreover, if in turn the Chapman-Kolmogorov transition operator in (4.2) happens to map Langevin distributions into Langevin distributions, then, (assuming a Langevin initial distribution for x(0)), the estimation iteration preserves the Langevin structure and an exact finite-dimensional filter results,
Visual Tracking of Points as Estimation on the Unit Sphere
99
described completely in terms of conditional mode and conditional concentration only. In this fortunate situation (4.2) provides an updating relation for the a priori mode of the form t~(t + lit) = F(k(tlt ) f~(tlt) )
(4.5)
k(t + lit ) = g(k(tlt)f~(tlt))
(4.6)
where F and g are in principle computable from the Markovian model. The two recursions are started with initial conditions fi(ll0 ) --- x0 and k(ll0 ) = n0 coming from the prior distribution for x(0). In reality, nontrivial examples where the Langevin distribution is preserved exactly in the prediction step, are hard to find. There are however extensive comparison studies, reported e.g. in [15] showing that some classical models (say Brownian motion on the sphere) tend in certain cases to preserve the Langevin structure, at least approximately. For these examples the formulas above can be used as a sort of "exponential" approximate filter (see [1] for a precise formulation of this concept).
Approximate Angular Gaussian filter To discuss a reasonable class of Markov models which approximately preserve the Langevin structure, we consider the popular linear Gauss-Markov state space model describing, say, a point moving randomly in 11(3 ~(t + 1) -- A~(t) + Bw(t)
(4.7)
where w is white Gaussian noise, and consider the dynamics of the direction vector x(t) :-- ~(t)/ll~(t)ll on the unit sphere. Note that x(t) is generally not Markov (for example, for B = 0, x(t) is Markov if and only if A is a multiple of an orthogonal matrix) so here one gives up the idea of modeling directly the dynamics on the sphere (see however the construction involving a random time change carried out for the projected Wiener process by 0ksendal [10]). The basic idea of this algorithm is to exploit the practical coincidence of Angular Gaussian and Langevin distributions described in the previous section. Let p(t) :-- II~(t)ll. We assume that the state model (4.7) has a known (conditional) noise to signal ratio a2u := aw (t) 2/p(t) 2, see Proposition 2.1. We shall also assume that both matrices A and B in the model are orthogonal matrices (or a scalar multiple of an orthogonal matrix) so t h a t the state covariance P(t) := E~(t)~(t) ~, solution of the Lyapunov equation
P(t + 1) = A P ( t ) X + p(t)2a~BB ' admits isotropic variance solutions P(t) = ax (t)2I, if the model is initialized with an initial condition of the same type. A key step of the algorithm is the "lifting" from an angular Gaussian distribution on the sphere S n-1 to an isotropic Ganssian on IRn
100
Alessandro Chiuso and Giorgio Picci
L(A, n) ~_ Ag(A, n) -+ H ( # , a 2) As seen in the previous section this lifting involves an arbitrary scale parameter p := I1#11. The steps of the algorithm are the following 1. Assume p(ytlx(t) = xt) " L(xt, no) and let p(xtly t-l) ~ L(fitlt_l, ktlt-1) 2. (Measurement update) when the measurement Yt becomes available one has p(xt[y t) ~ L(ftt[t, &t[t) where
1 ~tt]t = ~ ( ~ t [ t - l ~ t t [ t - 1
+ noyt)
kt[ t = [[ktlt_lfttlt_ 1 -]- noy t [[.
(4.8)
3. Think of L(ftt[t, ~tlt) as a conditional angular Gaussian distribution with the given parameters. 4. (Lifting to ~n.) Set ~(t) := p(t)x(t), a Gaussain random vector whose time evolution is described by the Gauss-Markov model (4.7) 5. The error covariance of the conditional mean (predictor) ~(t + 1 [ t), i.e. the conditional covariance of ~(t + 1) given yt, satisfies the well-known Kalman filtering update
P(t + 1 [ t) = A P ( t I t)A' + p(t)2a2BB ' From this, both P(t ] t) and P(t + 1 I t) are isotropic
P(t + 1 [ t) = p(t + 1)2a2(t + 1 [ t)I,
P(t ] t) = p(t)2a2(t [ t)I
and, by orthogonality of A, p(t + 1) = p(t), so the covariance update is equivalent to the following simple relation for the normalized scalar covariances 2 a~(t+l[t)=a~(t[t)+c%. 6. Project back on the unit sphere. The conditional angular Gaussian distribution of x(t + 1) given y t Ag(ftt+l]t ' kt+llt) is assimilated to a Langevin distribution, obtaining p(xt+l ]yt), i.e. ~tt+ll t = Af-ttl t ~t+l[t
=
aug'tit nu "~- ~t[t"
(4.9)
7. Repeat the first step when yt+l is available to compute p(xt+l [yt+l), etc.
Visual Tracking of Points as Estimation on the Unit Sphere Estimation
101
of a simple diffusion process evolving on a sphere
In this section we shall give an example of application of the "wide-sense" approach to estimation on spheres alluded at in section 2. We shall need to recall some basic facts a b o u t diffusion processes on a sphere. Let b(t) = [ b l ( t ) b u ( t ) ... ,bp(t)] t be s t a n d a r d Brownian motion in l~p. n r o c k e t t [2], has shown t h a t a diffusion process described by the stochastic differential equation dx(t) = f ( x ) dt + G ( x ) d b ( t ) evolves on the sphere
~n--1
if and only if the following conditions are satisfied
x'f(x) q- f'(x)x + trace{G(x)'G(x)) = 0 x'Gi(x) + G~(x)x = 0 Vx e ~n i = 1,...p.
(4.10) (4.11)
where G(x) := [Gl(x)G2(x)... Gp(x)]. Hence writing f(x) := A(x)x and Gi(x) := Bi(x)xi = 1 ... ,p where Bi(x) are square n x n, we see from the second equation t h a t the matrices Bi(x) m u s t be skew symmetric (Bi(x) t = - B i ( x ) ) and P
A(x) + A(x)' + E
B,(x)'Bi(x) = O.
i:l
The simplest situation occurs for A(x) and Bi(x) constant. Naturally the Bi's must be constant skew s y m m e t r i c matrices and A the sum of a skew symmetric m a t r i x plus a It6 "correction term", i.e. must look like
1
~_~ B~B, = $-2i=l
1 2
A
where ~ ' = -/'2 and 1 / 2 A has the expression 1
1
=
E~ai,jE j (~,J)
coming from the expansion of the Bi's in terms of the standard basis of (skew symmetric) elementary rotation matrices {Ei}. Under these conditions the diffusion equation P
dx(t) = Ax(t) dt + E
B~x db(t)
IIx(0) II = 1,
(4.12)
i=1
defines a homogeneous Markov process with values in S n - l , i.e. [[x(t)l I = 1 for all t _> 0. This simple "linear" model actually describes a r a t h e r special Markov process on the sphere. In fact, it turns out t h a t the stochastic differential equation (4.12) describes a Brownian motion (evolving on the sphere)
102
Alessandro Chiuso and Giorgio Picci
Brownian motion on spheres can be defined axiomatically as the natural analog of the process in ll~n and is discussed by several authors. The classical references are Perrin [11], McKean [9] and Brockett [2]. That the stochastic differential equation (4.12) represents a rotational Brownian motion on the sphere can be seen by rewriting it a little more explicitly as: dx(t) = [~dt + Ldb(t)] A x(t) - l A x ( t ) dt where ~A := $2. The term between square brackets is an infinitesimal random angular velocity vector dw(t), so that, dx(t) = dw(t) A x(t) + (It6 correction).
(4.13)
Now, assume t h a t the observation process is described by the same conditional law of the Langevin type P(Yt I xt) where ~o exp (aoX~yt) L(xt, a0) - 47r sinhao
(4.14)
as discussed earlier in this paper. Assume that the last available measurement
was y(to) and t h a t the a posteriori conditional distribution, p( Xto ly t~) "" L(ft( to It0), k(t0 Ito)) is available at time to. In this section, following the theory presented in section 2, we shall compute the best Langevin approximant of the a priori conditional density before the next measurement p(xt ly t~ t > to, in the sense of minimal Kullback distance. To this end, we don't need to solve the Fokker-Planck equation to obtain p(xt I yto) and then approximate it via minimization of the Kullback distance; we just need to recall t h a t the best Langevin approximant of p(xt I yto) is uniquely defined by the conditional mean m= (t I to) = E(x(t) I yto) according to the formulas (2.16). The conditional mean is immediately computed from the integral equation equivalent to (4.12)
If
IT.2 9 = 1,3
x(t) = eA(t-t~
+
m (t I to)
-
= exp{($2
eA(t-")a(x(s))db(s)
1A)(t
--
to)}m (to I to).
a2Jij (isotropic diffusion on the sphere) we obtain
1 A = a2I 2
and (4.15) becomes:
(4.15)
Visual Tracking of Points as Estimation on the Unit Sphere
m~(t I to) = e-~2(t-t~
- to)}m~(to I to).
103 (4.16)
which shows how the conditional mean tends to zero as t ~ c~, a natural phenomenon for diffusion processes. The parameters/~(t t to) and k(t I to) of the conditional Langevin distribution L(f~(t ] to), k(t I to)) approximating p(xt I yto) are obtained from
#(t I to) =
Hmz(t l to)ll =
mx(tlto)/llm.(tlto)ll cosh k(t } to) sinh k ( t l t o )
1 ~(t ] to)
Note that in order to get ~(t I to) we need to solve a trascendental equation. One may take advantage of the fact that for large ~(t I to) the second equation can be approximated by : 1
]]m=(t l to)H = 1
~(t I to)
to write an approximated explicit formula for
;~(tlto) (and
the exact equation
for ~(t I to)) 1
k(tlto) = 1 -[Im,(tlto)lJ f~(tlto)-
mx(t l to)
Ilmx(t l to)ll
(4.17)
(4.18)
Using this approximation and substituting (4.15) in the expressions above we finally obtain
~(t0 [to)
k(t I to) = ~(t0 { to)(1 - e-~2(~-~o)) + e -'rz(t-t~
f~(t I to) = exp{O(t - t0)}/~(t0 I to)
(4.19) (4.20)
In this way we obtain an approximate version of the conditional density
k(t l to)
p(xtly t~ ~- 47rsinhk(t l to) eXp(A(t l to)f~(t l to)'xt)
(4.21)
valid for an isotropic diffusion on the sphere and conditional concentration parameter larger than a few units. In general the Langevin approximation is fairly good for ~ greater than 2 or 3, see [15]. This concludes the discussion of the prediction step. The measurement update equations for adjunction of the next measurement Yt can be written in the usual format:
&(t I t) = II;~(t I to)f~(t ] to) + ~oytll 1
f,(t ] t) - k(t ] t) (~(t ] to)fJ(t ] to) + noYt)
(4.22) (4.23)
104
Alessandro Chiuso and Giorgio Picci
x 103 S
a) I
I
I
I
I
1460
1470
1480
1490
1500 time
6
4
2 0 1450
1450
1520
1530
1540
1550
1510
1520
1530
1540
1550
b)
x 10"3 8
1510
~
1460
1470
1480
1490
1500 time
Fig. 5.1. Plots of 1 - cos(O), a) solid: observed; dotted: estimated with .2, = I; b) solid :observed, dotted : estimated with A = A(t).
5. S i m u l a t i o n s A reference trajectory is generated according to the linear model (4.7) where the matrix A if a function of t : A(t) = A1 for t _< 1500 and A(t) = A2 for t > 1500; the noise enters the measurement process as described by (2.5) with concentration parameter no -- 500. Results are presented for the approximate angular Gaussian filter based on the true model or on a model with .zl _- I. Plots of 1 - cos(0) are shown in fig. 5.1.
6. C o n c l u s i o n s In this paper we have discussed a simple Bayesian estimation problem on spheres related to a prototype directional reconstruction problem in computer vision. For a fixed direction in space, a simple closed-form recursive M A P estimator is derived. For a general Markovian target approximate filters can be constructed.
Visual Tracking of Points as Estimation on the Unit Sphere
105
Acknowledgement. Research supported by grants ASI tLS-103 and RS-106 from the Italian Space Agency.
References 1. D. Brigo, Filtering by Projection on the Manifold of Exponential Densities, Ph. D. thesis, Department of Mathematics, Vrije Universiteit, Amsterdam, 1996. 2. R. W. Brockett, Lie Algebras and Lie Groups in Control Theory, in Geometric Methods in Control, R. W. Brockett and D. Mayne eds. Reidel, Dordrecht, 1973. 3. R. W. Brockett, Notes on Stochastic Processes on Manifolds, in Control Theory in the glst Century, C.I Byrnes, C. Martin, B. Datta eds. Birkhauser, 1997. 4. R. A. Fisher, Dispersion on a sphere, Proc. Royal Soc. London, A 217, p. 295-305, 1953. 5. B. Ghosh, M. Jankovic, and Y. Wu. Perspective problems in systems theory and its application in machine vision. Journal of Math. Systems, Est. and Control, 1994. 6. B. K. Ghosh, E. P. Loucks, and M. 3ankovic. An introduction to perspective observability and recursive identification problems in machine vision. In Proc. of the 33rd Conf. on Decision and Control, volume 4, pages 3229-3234, 1994. 7. P. Langevin, Magnetisme et theorie des electrons, Ann. de C h i m e t de Phys., 5, p. 70-127, 1905. 8. J. Lo and A. Willsky, Estimation for rotational processes with one degree of freedom, parts I, II, III, IEEB ~ransactions on Automatic Control, AC-20, pp. 10-33, 1975. 9. H. P. McKean, Brownian Motion on the Three-Dimensional Rotation Group, Mere. Coll. Sci. University of Kyoto, Series A, X X X I I I , N. 1, pp. 25-38, 1960. 10. Oksendai, Stochastic Differential Equations, Springer, 1990. 11. F. Perrin, l~tude Mathdmatique du Mouvement Brownien de Rotation, Ann. Ecole Norraale Superieure, (3), XLV: 1-51, 1928. 12. G. Picci, Dynamic Vision and Estimation on Spheres, in Proceedings of the 36th Conf. on Decision and Control, p. 1140-1145, IEEE Press, 1997. 13. S. Soatto, R. Frezza, and P. Perona. Motion estimation via dynamic vision. IBEE Trans. on Automatic Control, 41,393-413, 1996. 14. S. Soatto. A Geometric Framework for Dynamic Vision. Dr. Sc. Thesis, California Institute of Technology, 1996. 15. G. S. Watson, Statistics on Spheres, Wiley, N.Y 1983. 16. A. H. Jazwinski Stochastic processes and Filtering Theory Academic Press, New York, 1970.
Extending Visual Servoing Techniques to Nonholonomic Mobile Robots Dimitris P. Tsakiris, Patrick Rives, and Claude Samson INRIA Sophia-Antipolis 2004, Route des Lucioles, B.P. 93 06902, Sophia Antipolis Cedex - France
S u m m a r y . The stabilization to a desired pose of a nonholonomic mobile robot, based on visual data from a hand-eye system mounted on it, is considered. Instances of this problem occur in practice during docking or parallel parking maneuvers of such vehicles. In this paper, we investigate the use of visual servoing techniques for their control. After briefly presenting the relevant visual servoing framework, we point out some problems encountered when it is considered for nonholonomic mobile robots. In particular, simple velocity control schemes using visual data as feedback cannot be applied anymore. We show how, by using the extra degrees of freedom provided by the hand-eye system, we can design controllers capable of accomplishing the desired task. A first approach, allows to perform a visual servoing task defined in the camera frame without explicitly controlling the pose of the nonholonomic mobile basis. A second approach based on continuous time-varying state feedback techniques allows to stabilize both the pose of the nonholonomic vehicle and that of the camera. The experimental evaluation of the proposed techniques uses a mobile manipulator prototype developed in our laboratory and dedicated multiprocessor real-time image processing and control systems.
1. I n t r o d u c t i o n In order to perform a task with a mobile robot, one needs to efficiently solve m a n y interesting problems from task planning to control law synthesis. At the control level, i m p o r t a n t results have been established for nonholonomic systems, like wheeled mobile robots, which lead to specific control problems: not only the linearization of these systems is uncontrollable, so t h a t linear analysis and design methods cannot be applied, but also there do not exist continuous feedback control laws, involving only the state, capable of stabilizing such a system to an equilibrium, due to a topological obstruction pointed out by Brockett [1]. One of the approaches developed to solve the stabilization problem is the use of t i m e - v a r y i n g s t a t e feedback, i.e. control laws t h a t depend explicitly, not only on the state, but also on time, usually in a periodic way, which Samson [13] introduced in the context of the unicycle's point stabilization. This sparked a significant research effort( see for example [2] for a comprehensive survey), which d e m o n s t r a t e d the existence of efficient such feedback control laws and provided some design procedures. These results can be very useful in sensor-based control of mobile robotic systems. One of the prominent methods in this area is visual servoing, which
Extending Visual Servoing Techniques to Nonholonomic Mobile Robots
107
was originally developed for manipulator arms with vision sensors mounted at their end-effector [4], [3], [5], [6]. In this paper, we point out the difficulties of transferring directly these techniques to nonholonomic mobile robots. We show, however, that by properly adding degrees-of-freedom to the nonholonomic platform, in the form of a hand-eye system, and by taking advantage of the time-varying stabilizing control schemes, it is still possible to extend visual servoing techniques to nonholonomic systems [9], [14] For simplicity, we only consider here the planar case, where a mobile robot of the unicycle type carries an n-d.o.f, planar manipulator arm with a camera that moves parallel to the plane supporting the mobile robot. In a similar way, we only consider the kinematics model of the mobile robot which is sufficient to handle the problems due to the nonholonomic constraints. In section 2, we model the kinematics and vision system of a nonholonomic mobile manipulator with an n-degree-of-freedom planar arm. Section 3 is dedicated to the analysis and synthesis of various visual servoing control schemes for our system. Some related experimental results are also presented.
2. Modeling 2.1 Mobile Manipulator Kinematics We consider a mobile robot of the unicycle type carrying an n-d.o.f, planar manipulator arm with a camera mounted on its end effector (figure 2.1 shows the case of n = 3).
Fig. 2.1. Mobile Manipulator with Camera
108
Dimitris P. Tsakiris, Patrick Rives, and Claude Samson
Consider an inertial coordinate system {Fo} centered at a point O of the plane, a moving coordinate system {FM} attached to the middle M of the robot's wheel axis and another moving one {Fc} attached to the optical center C of the camera. Let (x, y) be the position of the point M and 0 be the orientation of the mobile robot with respect to the coordinate system {Fo}; let lr~ be the distance of the point M from the first joint B1 of the n-d.o.f, planar arm, with ll, ..., In being the lengths of the links of the arm and r ... , Ca being its joint coordinates. Let (XMC,YMC, OMC) represent the configuration of {Fc} with respect to {FM}, (XCT, YCT,OCT) represent the configuration of {FT} with respect to {Fc}, (xc, Yc, Oc) represent the configuration of {Fc} with respect to {Fo} and (XT, YT, OT) represent the configuration of {FT} with respect to {Fo}, where XT = d is the distance of point T from point O and YT = O T = O. From the kinematic chain of figure 2.1 we have for the case of an n degree-of-freedom manipulator arm: n
n
OMC=Er
i
n
XMC=lm+Elicos(ECJ)'
i=l
i=l
i
YMc=Elisin(ECJ)"
j=l
i=1
j=l
(2.1) n
n
oc =o + F , r
and
x c = x + Im c o s 0 +
i=1
i
l, cos (0 +
cj),
i=1
j=l
n
i
YC = Y +/rosin0 + E l , sin (0 + E C J ) " i----1
j=l
(2.2) Also
OCT :
OT -- OC, X C T = - - ( X c -- X T ) C O S O c YeT = (xc
--
(YC -
YT)
sin0c
,
-- X T ) s i n O c -- ( Y C -- Y T ) COS OC 9
(2.3) and
O:OT --OMC -- OCT , X = X T -- X C T C O S ( 0 T - - OCT ) "4- Y C T s i n ( 0 T Y = Y T -- X C T
sin(0T -- OCT) -
YeT COS(0T
- - OCT) -- X M C COS 0 -4- Y M C s i n - - OCT) -- X M C s i n
0
-
0,
Y M C COS 0 .
(2.4) Equations 2.1 and 2.3 are useful in simulating our system, while 2.2 and 2.4 are useful in reconstructing its state. Velocity K i n e m a t i c s : . By differentiating the chain kinematics of the mobile manipulator and its environment, assuming that we consider stationary targets and solving for the spatial velocity of the target frame with respect to the camera frame ~C T, we get
Extending Visual Servoing Techniques to Nonholonomic Mobile Robots --~CT de.f__
--
~CT XCT
109
-~"
where X ~ f (x, y, 0) r is the state of the robot, while q ~ f (r r -.., Cn) r is the configuration of the manipulator arm and where the matrix B1,1 is
(
sinOc o
sin~ )
- cosOc 0 -1
,
with ac given by Eq. 2.2, b~'1 def --Ira sin ( ~-'~i=1 ~,]
(2.6) /.-,i=1 , sin ( ~'~j=i+l CJ),
~y2l , l d=e f _ t ~ cos ( E , =" 1 r . - 1 l, cos ( ~j=i+ln Cj) - 1,~ and the 3 • n ma- ~--~i=1 trix B1,2, which is the Jacobian of the manipulator arm, given by _ - ~-~i=1
-
' sm ( ~-~j=i+l
CJ)
E,\-~I z, cos ( E ~ = , + , r
9 0
+ i.
.
(2.7)
-1 N o n h o l o n o m i e C o n s t r a i n t s : . The nonholonomic constraints on the motion of the mobile robot arise from the rolling-without-slipping of the mobile platform's wheels on the plane supporting the system. Due to these constraints, the instantaneous velocity lateral to the heading direction of the mobile platform has to be zero. From this we get the usual unicycle kinematic model for the mobile robot: =vcosS, def
9=vsinO,
t~=w,
(2.8)
.
where v = x cos 0 + y sin t9 is the heading speed and w is the angular velocity of the unicycle. Then
/ cos 0 0
X=Ba,I(X)(:)=
~Slo001)
(:)
.
(2.9)
2.2 V i s i o n M o d e l We consider a special case of the general visual servoing framework developed in [4], [3] and surveyed in [5], [6], [12], as it applies to a hand-eye system composed of a manipulator arm with a camera mounted on its end-effector. Consider a fixed target containing three easily identifiable feature points arranged in the configuration of figure 2.1. The coordinates of the three feature points with respect to {FT} are (zp{T}, y{pT}), P 6 {l, m, r}. The distances a and b (fig. 2.1) are assumed to be known. The coordinates of the feature points with respect to the camera coordinate frame {Fc} can be easily found.
110
Dimitris P. Tsakiris, Patrick Rives, and Claude Samson
We assume the usual pinhole camera model for our vision sensor, with perspective projection of the target's feature points (viewed as points on the plane ]R 2) on a 1-dimensional image plane (analogous to a linear CCD array). This defines the projection function P of a point o f / R 2, which has coordinates (x, y) with respect to the camera coordinate frame {/Pc}, as
p: ~+ • ~
~ ~ : (x,y),
~ p ( x , y ) = yY.
(2.10)
X
where f is the focal length of the camera. In our setup, the coordinate x corresponds to "depth". Let the projections of the target feature points on the image plane be Yp = y(xp~t{C},yp{C}~), p E {l,m,r}, given by 2.10. The vision data are then Yv d_ef(Yl, Ym, Yr) x. Differentiating 2.10, we get the well-known equations of the optical flow [7] for the 1-dimensional case:
L = B2,1(yp, x~C)) eCT=
--
Y~ ~ I
~(I2 + Y~) |
e Cr,
(2.11) where the matrix B2,1 (Yp,xp{C}) corresponds to the Jacobian of the visual data, so-called interaction matrix [4], [10].
2.3 Visual Servoing Scheme The above modeling equations of the mobile robot with the n-d.o.f, manipulator arm can be regrouped to derive a few basic relations. The state of the system is X -- (X, q)X. Then
The sensory data are Y = (Yv, q)T. Then
The relationship between the state and the sensory data Y -- ~ ( X ) is given by equations 2.10, 2.11, 2.5 and 2.12. The corresponding differential relationship is = -~(X) X = B2(X)BI(X) X. (2.14) The controls of the system are U = (v, w, wr , . . . , w r Then -~ ----
( B3,1 03•
---- B 3 ( X ) U ---- ~k0nx2
~nxn
n- = (v, 9, q)X.
(2.1~)
Extending Visual Servoing Techniques to Nonholonomic Mobile Robots
111
3. V i s i o n - b a s e d C o n t r o l of Mobile M a n i p u l a t o r s 3.1 C a m e r a P o s e S t a b i l i z a t i o n :
In this first approach, we show that it is possible to use a velocity control scheme as done in the holonomic case, provided that the control objective does not require to explicitly stabilize the pose of the mobile platform. To illustrate this possibility, we consider a reduced system with only one actuated pan-axis (n = 1). Our objective is to stabilize the camera to a desired pose, obtained by specifying the corresponding vision data II* def (y/,, y * , y , ) T . We select the task output e ( X ) = Yv - Y*, with dim e = 3, which we want to drive exponentially to zero. The system state is X = (x,y,~, r T, the measurement data are Y = (Yl, Ym Yr, r T and the control is U = (v,w,wr T. From 2.11 we get = Yv = B2,i(X) ~ C T = _A e .
(3.1)
Then, away from singularities of B2,1, we have _~CT = - A B~,~ (X)(r~ - ]I*).
(3.2)
From the system kinematics we have \O/-cosr ~ C T = (BI,I(X) B1,2(X))B3(X)ld=I s m r
sin r -[/mCOSr -1
(3.3) where the 3 • 4 matrix (B1,1 B1,2) is given by 2.6 and 2.7, and the 4 • 3 matrix B3 is given by 2.15, by setting n = 1. The product (B1,1 B1,2)B3 depends only on r It is a nonsingular matrix, since its determinant is - l m . Then bl =
[/
BI,I(X) B1,2(X))B3(X
~CT
(3.4)
and, using 3.2, we finally get Lt = -)~ B I , I ( X ) B 1 , 2 ( X ) ) B 3 ( X )
B~,,~(X*)(Y~ - ]I*).
(3.5)
Subjected to" this control law, the mobile manipulator moves so that the camera asymptotically reaches its desired pose with respect to the target. However, the pose of the mobile platform itself is not stabilized, and problems of drift due to the non-stable zero-dynamics of the system can occur. In practice, however, friction will have a stabilizing effect and the platform come to a rest. However, the final pose reached by it will still largely depend upon its initial position and orientation. Related experimental results obtained in [9] are shown in fig. 3.1, where the trajectories of the system for two different initial configurations, but with the same desired camera pose with respect to the target, are plotted. The different final poses of the mobile platform can be seen.
112
Dimitris P. Tsakiris, Patrick Rives, and Claude Samson
Pos~on
Final
J
~
F~
Po.~
/ ......... TAROh-'r
(a) (b) Fig. 3.1. Robot trajectories for two different initial configurations and the same desired camera pose 3.2 M o b i l e B a s e P o s e S t a b i l i z a t i o n :
Consider the same system as in section 3.1 (i.e. the mobile robot with only one actuated pan-axis). Its state, sensory d a t a and control input variables are also as before. In this second approach, we consider the stabilization of the mobile platform to a desired pose with respect to some target. At the same time, we require that the camera tracks the targets, whatever the motion of the platform. The role of the arm is, in this case, to provide an extra d.o.f., which will allow the camera to move independently. One of the approaches developed to solve the point stabilization problem for nonholonomic mobile robots is the use of time-varying state feedback, i.e. control laws that depend explicitly, not only on the state, but also on time, usually in a periodic way. Samson [13] introduced them in the context of the unicycle's point stabilization and raised the issue of the rate of convergence to the desired equilibrium. In this section, we apply techniques of time-varying state feedback, recently developed by Morin and Samson [8], in a visual servoing framework [14]. The problem that we consider is to stabilize the mobile platform to the desired configuration which, without loss of generality, will be chosen to be zero, i.e. X* = (x*, y*, 0*, r = 0. The corresponding visual d a t a ]I* = (Yt*, Y~*, Y~*) can be directly measured by putting the system in the desired configuration or can be easily specified, provided d is also known, along with the target geometry a and b (see figure 2.1). An exponentially stabilizing control is considered for the mobile platform, while a control that keeps the targets foveated is considered for the camera. M o b i l e p l a t f o r m c o n t r o l s y n t h e s i s :. In order to facilitate the synthesis of the controller, we apply a locally diffeomorphic transformation of the states and inputs
Extending Visual Servoing Techniques to Nonholonomic Mobile Robots (Xl,12,13)
T
deal (x,y,
tanO) -r
'
Ul = cos0 v ,
u2 -- - -
1
COS2 0
w
'
113 (3.6)
which brings the unicycle kinematics (eq. 2.8) in the so-called chained form [11], [8]: :rl =
Ul
,
x,2 =
X3Ul
53=u2.
,
(3.7)
The mobile platform control, that can be used if the state is known or reconstructed, is given by: v ( t , x ) - ~1 ul(t,~(X)) , ~(t, x ) = cos ~ e u,(t, ~ ( x ) ) , -
(3.8)
-
where ul and u2 are the time-varying state-feedback controls, developed by Morin and Samson [8] for the 3-dimensional 2-input chained-form system. These controls, given in terms of the chained-form coordinates of equation 3.6, are : ul (t, xl, 12,13) = kx [p3(12,13) + a ( - x l sin wt + Ixl sin wtl) ] sin w t , u2 (t, 11,12, x3) = k p2(x2) J '
where p2(x~) ded Ix~l f, p3(x~, ~3) ded (Ix~l 2 +
1~31~) ~,
(3.9)
w is the frequency of
the time-varying controls and a, kl, k2, k3 are positive gains. The exponential convergence to zero of the closed-loop system can be established using the homogeneous norm p(xl, x2, x3) def (ix116 + ix 212 + i1313)~ . The control L/ for the mobile platform is then
Lt(t, X) = (v(t, X), w(t, X)) T.
(3.10)
Such a control requires an estimate )( of the current state X. This estimate can be provided by state reconstruction from the visual d a t a [14]. However, since we are interested in positioning the mobile robot to the desired configuration X* = 0, while starting relatively close to it, we could attempt to do so without reconstructing its state explicitly. Since Y = ~ ( X ) , the state X can be approximated, near the configuration X* = 0, up to first order by 0~
.
X(Y) = [~--~(X )]
--1
(Y-V*),
(3.11)
where o@ = B2(X)BI(X) with B1 and B2 as specified in 2.12 and 2.13 by setting n = 1. The proposed control law for the mobile platform can thus be expressed as a function of only the sensory data /d = L/(t,Y).
(3.12)
114
Dimitris P. Tsakiris, Patrick Rives, and Claude Samson
A r m c o n t r o l s y n t h e s i s :. In order to implement a vision-based s t a t e feedback control law for the mobile platform, we have to track the target during the motion of the platform. The arm control wr is chosen to keep the targets foveated by regulating the angular deviation of the line--of-sight of the camera from the targets to zero, while the mobile robot moves. It is specified so that Y,n is made to decrease exponentially to Y~*, by regulating the task function e(X) de f Y r n - Y~ to zero and by making the closed-loop system for e behave like ~ = - A e, for a positive gain A. This gives J2,2 \) - Ym,) (3.13) wr (t, X, Y) -- fl2,3
(J2,,
where L72,~ is the ( 2 , / ) - e n t r y of In particular, L7~,3----- f -
(~
the matrix J(X) de=fB2(X) B,(X) B3(X).
+ -~).
The first term of equation 3.13 makes
the arm track the targets, while the term in parenthesis pre~-compensates for the motion of the mobile robot. A useful simplification of this law can be obtained by ignoring this pre-compensation term. E x p e r i m e n t a l R e s u l t s :. This control law has been validated by simulations and real experiments. Our test-bed is a unicycle-type mobile robot carrying a 6 d.o.f, manipulator arm with a CCD camera ([9],[15]). In the experimental results presented below, we use the control law 3.12 with the unicycle controls 3.8, the arm control 3.13 and the state approximation 3.11 by sensory data. The following parameters corresponding to the models developed above are used: ll -- 0.51 m, 12 -- 0.11 m, d -2.95 m, f -- 1 m. The following gains are used for the above control laws: w -- 0.1, kl = 0.25, k2 = 2, k3 -- 100, a = 10, A = 12. The controls 3.9 are normalized to avoid actuator saturation and wheel sliding; this does not affect the exponential stabilization of the system, only its rate. Initial experiments used the raw visual data to calculate the state and the controls. Implementation of such a scheme leads to significant small oscillations and jerks during the motion of the system. To fix this problem, subsequent experiments used Kalman filtering of each of the state variables (x, y, 8). This makes the corresponding trajectories smoother and helps in compensating for the vision-induced delays. No filtering was used on the visual data. The resulting (x, y ) - t r a j e c t o r y as well as the corresponding controls v, w are plotted in figure 3.2. The dotted line represents d a t a obtained by odometry, while the solid one represents d a t a obtained by vision. Each period of the time-varying controls corresponds to 1570 samples (data on the state of the system are recorded every 40 msec). 3.3 S i m u l t a n e o u s M o b i l e B a s e a n d C a m e r a P o s e S t a b i l i z a t i o n : The approaches in sections 3.1 and 3.2 can be seen as complementary. The first one can be used to stabilize the camera to a desired position and orientation with respect to a target, but the final pose of the mobile basis is not
Extending Visual Servoing Techniques to Nonholonomic Mobile Robots (z.y) t r a j e c t o r y
Tlme--varyi
n
Controle
(v.
115 w)
i ?-
F
T,T
J -o+s+
- -
| " "
Odometry
x-~x 9
- -
9
..Ta
Vision
§
m~
Time
kpprozl~ltion
Fig. 3.2. Mobile robot (x, y)-trajectory and controls v, w controlled. T h e second one can stabilize the mobile basis to a desired pose with respect to a target and t r a c k this t a r g e t with the c a m e r a while the robot moves, but it cannot independently stabilize the c a m e r a to a desired position and orientation. W h e n additional d.o.f.s are available in the arm, they can be used to accomplish b o t h goals simultaneously. In this section we consider a mobile robot with a 3-d.o.f. a r m as in fig. 2.1. Our goal is to simultaneous stabilize the mobile basis to a desired pose taken as X* = 0 and the c a m e r a to a desired pose for which the corresponding visual d a t a are Y* d__ef(Yl*, Y*, y * ) T . T h e system state is X = ( x , y , O , r 1 6 2 1 6 2 T, the m e a s u r e m e n t d a t a are Y = (Yl, Ym Yr, r r r T and the controls a r e / 4 -- (v, w, wr , wr , wr T. T h e control components v and w, in charge of stabilizing the mobile platform can be determined as in section 3.2. T h e only difference is the state estimation, which is done by using matrices B1 and B2 t h a t correspond to the present setup (n -- 3). As in section 3.1, the c a m e r a pose stabilization is cast as the problem of regulating to zero the task function output e(X) de--4fYv - Yv* with dim e = 3. From 2.11 and the system kinematics, we get : = ]?v -- B2,1(X) (BI,z(X) B1,2(X))B3(X)/4.
(3.14)
where the matrices B I , I , B1,2, B3 are given in 2.6, 2.7 and 2.15, for n = 3. From 3.14, and since we want the equation @= - A e, to be satisfied for the controlled system, we get
B2,1(X)Bz,z(X)B3,z(X)
(v) + B2,1(X)BI,2(X) r+l ~wr } \wr
/
= - A e.
(3.15)
116
Dimitris P. Tsakiris, Patrick Rives, and Claude Samson
Finally, solving for the arm controls wr wr wr we get, away from singular configurations where B1,2(X) and B2,1 (X) are not invertible, wr j = - A B x,z(X)B~,~(X)(Yv -1 - Y*) - B ~ , ~ ( X ) B I , I ( X ) B 3 , 1 ( X ) \wr I
.
(3.16) As previously, the first term of the above equation makes the arm track the targets, while the second term pre-compensates for the motion of the mobile basis. Notice that det B1,2 = -1112 sinr therefore configurations where it is zero are singular and should be avoided. The validity of this control law has been tested in simulation. The (x, y ) trajectory of the mobile robot is very similar to the one in fig.3.2 and is not shown here. 4. C o n c l u s i o n We presented several approaches to the application of visual servoing techniques to hybrid holonomic/nonholonomic mechanical systems. How appropriate each of these approaches is, depends on the task to be performed and on the mechanical structure of the robot. The first approach, based on output linearization, proved to be robust with respect to modeling errors and measurement noise in both simulations and experiments. For tasks which only involve positioning the camera with respect to the robot's environment (e.g. target tracking, wall following, etc.), this first scheme applies. However, it does not apply anymore when the task explicitly requires stabilizing the nonholonomic platform to a desired pose, like, for example, in a parking maneuver. The second approach involving time-varying feedback techniques is, in this case, better adapted. The use of redundant systems allowing simultaneous stabilization of the camera and the nonholonomic platform brings up some exciting research issues in a large field of applications, like those where the robot has to navigate in highly constrained environments (e.g. nuclear plants or mine fields). The results presented here are however preliminary and their experimental evaluation is currently in progress using the test-bed described above. In particular, several theoretical and experimental issues need to be addressed concerning the robustness of such control schemes. References 1. R.W. Brockett, "Asymptotic Stability and Feedback Stabilization", in Differential Geometric Control Theory, Eds. R.W. Brockett, R.S. Millman and H.J. Sussmann, Birkhauser, Boston, 1983. 2. J.P. Laumond and al., "Robot Motion Planning and Control", Ed. J.P. Laumond, Lecture Notes in Control and Information Sciences, 229, Springer Verlag, 1997. 3. F. Chaumette, La relation vision-commande: thdorie et applications g~des t~ches robotiques, Ph.D. Thesis, University of Rennes I, France, July 1990.
Extending Visual Servoing Techniques to Nonholonomic Mobile Robots
117
4. B. Espian, F. Chaumette and P. Rives, "A New Approach to Visual Servoing in Robotics", I E E E Trans. on Robotics and Automation 8,313-326, 1992. 5. G.D. Hager and S. Hutchinson, Eds., "Vision-based Control of Robotic Manipulators", Special section of I E E E Trans. Robotics and Automation 12, 649-774, 1996. 6. K. Hashimoto, Ed., Visual Servoing, World Scientific, 1993. 7. B.K.P. Horn, Robot Vision, Mc Graw-Hill, 1986. 8. P. Morin and C. Samson, "Application of Backstepping Techniques to the TimeVarying Exponential Stabilization of Chained Form Systems", INRIA Research Report No. 2792, Sophia-Antipolis, 1996 9. R. Pissard-Gibollet and P. Rives, "Applying Visual Servoing Techniques to Control a Mobile Hand-Eye System", I E E E Intl. Conf. on Robotics and Automation, 1995. 10. C. Samson, M. Le Borgne and B. Espiau, Robot Control: The Task Function Approach, Oxford University Press, 1991. 11. J.-B. Pomet and C. Samson, "Time-Varying Exponential Stabilization of Nonholonomic Systems in Power Form", INRIA Research Report No. 2126, SophiaAntipolis, 1993. 12. P. Rives, R. Pissard-GiboUet and L. Pelletier, "Sensor-based Tasks: From the Specification to the Control Aspects", The 6th Intl. Symposium on Robotics and Manufacturing, Montpellier, France, May 28-30, 1996. 13. C. Samson, "Velocity and Torque Feedback Control of a Nonholonomic Cart", in Advanced Robot Control, Ed. C. Canudas de Wit, Lecture Notes in Control and Information Sciences, No. 162, Springer-Verlag, 1990. 14. D.P. Tsakiris, C. Samson and P. Rives, "Vision-based Time-varying Mobile Robot Control", Final European Robotics Network (ERNET) Workshop, Darmstadt, Germany, September 9-10, 1996. Published in Advances in Robotics: The ERNETPerspective, Eds. C. Bonivento, C. Melchiorri and H. Tolle, pp. 163-172, World Scientific Publishing Co., 1996. 15. D.P. Tsakiris, K. Kapellos, C. Samson, P. Rives and J.-J. Borrelly, "Experiments in Real-time Vision-based Point Stabilization of a Nonholonomic Mobile Manipulator", Preprints of the Fifth International Symposium on Experimental Robotics (ISER'97), pp. 463-474, Barcelona, Spain, June 15-18, 1997. 16. D.P. Tsakiris, P. Rives and C. Samson, "Applying Visual Servoing Techniques to Control Nonholonomic Mobile Robots", Workshop on "New Trends in Imagebased Robot Servoing", International Conference on Intelligent Robots and Systems (IROS'97), pp. 21-32, Grenoble, France, September 8-12, 1997.
A Lagrangian Formulation of N o n h o l o n o m i c Path Following Ruggero Frezza 1, Giorgio Picci 1,2, and Stefano Soatto 3,4 1 2 3 4
Universits di Padova, via Gradenigo 6a, 35100 Padova - Italy Consiglio Nazionale delle Ricerche, Padova - Italy Washington University, St. Louis, MO 63130 USA Universits di Udine, via delle Scienze, 33100 Udine - Italy
S u m m a r y . We address the problem of following an unknown planar contour with a nonholonomic vehicle based on visual feedback. The control task is to keep a point of the vehicle as close as possible to the contour for a choice of norm. A camera mounted on-board the vehicle provides measurements of the contour. We formulate the problem and compute the control law in a moving reference frame modeling the evolution of the contour as seen by an observer sitting on the vehicle. The result is an on-line path planning strategy and a predictive control law which leads the vehicle to land softly on the unknown curve. Depending on the choice of the tracking criterion, the controller can exhibit non-trivial behaviors including automatic maneuvering.
1. I n t r o d u c t i o n In this paper we consider the problem of tracking an unknown contour by a nonholonomic vehicle, using visual feedback. This is a fundamental problem in autonomous navigation. The contour to be followed may be the b o u n d a r y of some unknown obstacle or one of the borders of an unknown road which the vehicle should follow. An on-board camera provides measurements of the contour and the control should primarily be based on information on the contour coming from the vision system. Following the basic paradigms of system and control theory, we design a feedback control action based on a local estimate of the contour obtained from video measurements of some feature points of the unknown path, tracked on the image plane. The estimate must be continuously u p d a t e d based on both the current measurements and on some a priori mathematical model of how the contour seen by the moving camera changes in time. The design of real-time tracking strategies of unknown curves brings up new problems in control. One such problem is on-line path planning, i.e. the design of an optimal connecting contour to the curve being followed, depending both on the current state of the vehicle and on the local shape of the contour. The connecting contour must also satisfy the geometric and kinematical constraints of the navigation system. On-line path planning is a new problem typical of autonomous navigation. We shall discuss a simple solution to this problem in a two-dimensional setup, in section 2. of this paper.
A Lagrangian Formulation of Nonholonomic Path Following
119
Although a considerable amount of literature has appeared on trajectory tracking by nonholonomic vehicle, very little is available on the problem of both estimating the contour and tracking it. The pioneering work was done by Dickmanns and his group [2, 3, 4] but general models for contour estimation are discussed for the first time in [7]. For low operational speed we can neglect inertias and model the car kinematically. In this setting the vehicle obeys a nonholonomic dynamics of the fiat type [6]. This greatly facilitates the design of (open loop) path following controls, provided the assigned path is specified in the so-called "flat outputs" space. The idea here is to formulate the tracking problem as a constrained approximation of the desired path with feasible trajectories of the vehicle. The result is a novel control scheme in which estimation and control are mixed together. This paper evolves on preliminary results presented in [7, 8, 9]. 1.1 S i m p l e m o d e l o f a v e h i c l e f o l l o w i n g a n u n k n o w n
contour
The simplest kinematic model of a vehicle is a wheel that rolls without slipping. In this paper we only consider planar roads, which we represent in an inertial reference frame {0, X, Y } as parametrized curves
Fo = { ( X ( s ) , Y ( s ) ) e l:[ 2, s E [O,S] C ~ }
(1.1)
where s is some curve parameter, for instance arc-length. We will assume that F is of class at least C 1, i.e. that it is continuous along with its tangent. A wheel rolling without slipping can be represented as a moving frame {o, x, y} that rotates about the normal to the road-plane at a rate w (rad/s), but can only translate along one independent direction. Without loss of generality we let the direction of translation coincide with the x - a x i s , so that the instantaneous translational velocity of the vehicle is represented as iv 0] T with v the longitudinal speed (m/s). Such a restriction on the velocity of the wheel does not impose limitations on the positions it can reach. Constraints on the velocity of a system that cannot be integrated into constraints on position are called non-holonomic; there is a vast literature on controllability, stabilization and path planning for systems with non-holonomic constraints [11, 12, 14, 13]. In the moving frame, the road is represented as a contour F(t) t h a t changes over time under the action of the motion of the vehicle: F(t) -{(x(l,t),y(l,t)) E ~:t2, l E [0, L] C ~ } . In order to simplify the representation, we will assume that - locally at time t - F ( t ) satisfies the conditions of the implicit function theorem, so that we can let x(l, t) = 1 V t and l C [0, L]. Consequently the contour can be represented as a function y = 7(x, t) x e [0, L].
(1.2)
Such a representation breaks down, for instance, when the road winds-up or self-intersects or when the vehicle is oriented orthogonal to it (see figure 1.1).
120
Ruggero Frezza, Giorgio Picci, and Stefano Soatto
Y
L
x Fig. 1.1. An elementary model of a vehicle .following an unknown contour. 1.2 L o c a l e v o l u t i o n o f t h e c o n t o u r
We take a Lagrangian viewpoint by considering the evolution of the contour seen from the moving vehicle. A point which is stationary in inertial coordinates has coordinates (x, y) that evolve according to
{
~ = -wy + v y=wx
(19
In particular, points on the contour, that is pairs of the form ( x , y ) (x, 7(x, t)), evolve according to =
=
v) +
OV
--
(1.4)
The above is a partial differential equation that can be interpreted as governing the evolution of the surface {7(x, t), x C [0, L], t E [0, t f]. 1.3 M e a s u r e m e n t
process
When we drive a car our visual system measures the perspective projection of the 3-D world onto a 2-D surface, such as our retina or the CCD surface of a video-camera. We model such a projection as a plane projective transformation from the road plane to the retinal plane9 We choose a camera re/erence-frame centered in the center of projection, with the x - a x i s orthogonal to the retinal plane. For the sake of simplicity, we consider the optical center of the camera to coincide with the center of the wheel. W h a t we can measure is then the perspective projection [y]
)=
Y+n
(1.5)
A Lagrangian Formulation of Nonholonomic Path Following
121
up to a white, zero-mean Gaussian noise n. In practice it is c o m p u t a t i o n a l l y prohibitive to measure the projection of the whole contour. Instead, it is more convenient to process regions of interest and localize the position of the projection of the contour at a few, controlled locations on the image:
{~
7(x,t)
x e [xl,...,xN]}.
(1.6)
Note t h a t the positions xi can be considered control p a r a m e t e r s , t h a t can therefore be chosen according to some optimality criterion. If we measure the images of a few corresponding points on the road plane seen from different viewpoints, it is quite easy to recover the projective transformation induced by the perspective projection of the road onto the c a m e r a (see for instance [5]). Therefore, in the remainder of the p a p e r we will assume t h a t we can measure directly pairs of coordinates (xi,v(xi,t)) i = 1...g on the road-plane from the image coordinates. If we couple equation 1.4 with the measurements, we end up with a distributed dynamical system: 0f - -~x
-
o~
(~7(x, t) - v)
y~(t) = 7 ( x i , t ) + n i ( t )
(1.7)
i = l...N.
Our goal is t h a t of using the inputs v, w to control the evolution of 7(x, t) in order to drive the vehicle along the contour. Towards this goal, we consider a local representation of the contour described as follows.
1.4 Local r e p r e s e n t a t i o n o f t h e m o v i n g c o n t o u r Consider a local representation of the contour around the point x = 0 via the moments ~1 (t) - ~(0, t) ~(t)
- ~(0,t)
(1.8)
027 (0, t) ~3 (t) - b-~z~ ` :--: T h e first two variables ~1 and ~2 encode a notion of "relative pose" between the vehicle and the contour. In particular ~1 could be interpreted as an approximation of the distance from the vehicle to the contour, and ~2 as the relative orientation between the two (of course the a p p r o x i m a t i o n becomes more accurate as the vehicle gets closer to parallel to the tangent to the
122
Ruggero Frezza, Giorgio Picci, and Stefano Soatto
contour at x = 0). The terms ~k k > 2 encode curvature and higher terms, which characterize the "shape" of the contour, an invariant p r o p e r t y of the Euclidean plane. Such invariance of shape is somewhat hidden in this representation for it is distributed a m o n g all moments. It is easy to derive the time evolution of the moments: just substitute the above definitions into the dynamics of the contour in the viewer's reference (1.7):
41 = ~2 (?2 -- Cd~l ) 42 =
3(v -
-
+ 1)
43 : ~4 (V -- 02~1 ) -- 3~2~3W
(1.9)
44 = ~ (v - w~l) - 4W~l ~4 - 3w~2
The chain of derivatives does not close in general. It does, however, when the contour F can be c o m p u t e d as the solution of a finite-dimensional differential equation with a p p r o p r i a t e b o u n d a r y values. An instance of this case is the case of linear-curvature curves, which has been studied in detail by M a and al. [10]. In the interest of generality, we do not wish to impose constraints on the contours to be followed. We will see how this results in control laws t h a t exhibit non-trivial behaviors, such as automatic maneuvering.
2. T r a c k i n g as a n a p p r o x i m a t i o n t a s k W h a t does it mean for a vehicle to track a given contour? We certainly would like t h a t the t r a j e c t o r y followed by the vehicle be "close" in some sense to the target contour while being "feasible", i.e. satisfying the kinematic (and possibly dynamic) constraints. Therefore it seems reasonable to pose the tracking problem as an approximation task where we choose a m o n g the feasible trajectories the one t h a t best approximates the given contour for a given choice of norm. While the class of feasible trajectories depends only upon the kinematic (and dynamic) constraints of the vehicle and can therefore be pre-computed, the target contour is not known a-priori, but it is r a t h e r estimated on-line in a causal fashion, and such an estimate is subject to uncertainty. Therefore, it seems unavoidable t h a t the control strategy should be u p d a t e d in response to new measurements t h a t add information (i.e. decrease uncertainty) on the target contour. In this section we will present a novel control s t r a t e g y t h a t captures this mixed feed-forward and feed-back nature. Before doing that, as a further motivation, we discuss the limitations of a simple controller based upon feedback linearization.
A Lagrangian Formulation of Nonholonomic Path Following
123
2.1 C o n v e n t i o n a l c o n t r o l via feedback line.arization Consider the relative pose variables ~1 and ~2 defined in (1.9); they evolve according to the first two components of the differential equations in 1.10), which are of the form
One could now solve (2.1) for [w,v] and assign to ~1 and (2 any desired dynamics. For instance, one could regulate ~1 and (2 to zero exponentially by imposing
where ~ and ~ are positive real numbers. Such a choice would result in the following feedback control law
Such a method, however, cannot result in a practical control law since it demands exact knowledge of the shape of the contour at all times. In fact, the control law depends on (3 which encodes the local curvature of the contour. While measuring curvature from visual input is extremely sensitive to noise in the m e a s u r e m e n t s "~(xi, t), one could augment the state to include (3. But then, according to (1.10), the dynamics depends on (4, which is unknown, and so on. One possible way to overcome this problem is to restrict the attention to classes of target contours t h a t generate a finite-dimensional model (1.10), as done in [10] for linear-curvature curves. However, we do not want to impose restrictions on the contours t h a t can be tracked, which leads us to the novel control strategy described in section 2.3. 2.2 F e e d f o r w a r d action: p l a n n i n g a c o n n e c t i n g c o n t o u r In this section we wish to give the reader some intuition on the reasons t h a t motivate the control law which we introduce in section 2.3. Let us pretend for a m o m e n t t h a t our vehicle was on the target contour and oriented along its tangent, so t h a t ~1 --- ~2 = 0. Then it would be immediate to write an exact tracking control law. In fact, from (1.10) one sees t h a t choosing w -- v~3 causes ~i(t) = 0 V t. Therefore, in the peculiar case in which the vehicle is already on the target contour and heading along its tangent, a control proportional to its curvature, namely
~(t)
=
027.
v~-~x2(0, t )
(2.4)
124
Ruggero Frezza, Giorgio Picci, and Stefano Soatto
ro Y
Fig. 2.1. Exact tracking control: if the vehicle is on the contour, oriented along the tangent, a control proportional to the curvature of the target contour can achieve perfect tracking. The control is, however, unfeasible because of uncertainty. is sufficient to maintain the vehicle on the contourat all times (see figure 2.1). Needless to say, we cannot count on the vehicle ever being exactly on the contour. However, causality and the non-holonomic constraint imposes t h a t any feasible t r a j e c t o r y must go through the current position of the vehicle, and it must have its tangent oriented along its x-axis. Since the target contour is known only locally through the m e a s u r e m e n t s V ( x i , t ) , i = 1 . . . N , one could imagine an "approximation" to the target contour which, in addition to fitting the measurements ~'(xi,t), also satisfies the two additional nonholonomic constraints (see figure 2.2). We call such a p p r o x i m a t i n g t r a j e c t o r y a "connecting" contour 1. For the case of a wheel, the connecting contour would start at the current position of the vehicle with the tangent pointing along the x - d i r e c t i o n , and end at a point (Xc, V(xc, 0)) on the contour with the same tangent. Overall the connecting contour Vx must satisfy the minimal set of conditions: c(0) = 0 c(xc) = (Xc,0)
= 0
(2.5)
=
The simplest curve t h a t satisfies the four above conditions is a polynomial of degree 3. Now, one m a y think of the composition of the connecting contour % with the target contour 7 as a new target contour. By construction, the vehicle is 1 The choice of the connecting contour depends upon the differentially flat structure of the system. A connecting contour for a flat system of order p (i.e. the flat outputs need to be differentiated p times to recover the state) must satisfy p causality conditions and be of class C p- 1. For example, the connecting contour for a vehicle with M trailers must satisfy at least M + 2 causality conditions.
A Lagrangian Formulation of Nonholonomic Path Following
125
t
ro
Y
Fig. 2.2. Planning a connecting contour. For a trajectory to be feasible, it must pass through the origin of the moving plane (current position of the vehicle) and it must be oriented along the x axis. A connecting contour is an approximation to the target contour that simultaneously satisfies the feasibility constraints. on such a contour, and oriented along its tangent. Therefore, one may hope to be able to apply the exact tracking controller (2.4), where the curvature is that of the connecting contour (see figure 2.2):
w(t) - v ~
(0, t)
(2.6)
This strategy is bound to failure for several reasons. First the composite contour is not a feasible path for the vehicle since continuity of the secondderivative (and therefore of the control) is not guaranteed at xc. Second, while the connecting contour is being planned, the vehicle may have moved, so that the initial conditions (2.5) are violated. More in general, the controller should be updated in response to added knowledge about the contour whenever a new measurement comes in, aking to a feedback control action. These considerations leads us into the next section, where we introduce the control law. 2.3 F e e d b a c k : u p d a t i n g t h e c o n t r o l a c t i o n Suppose at a certain time t we plan a connecting contour "yc and act with a controller (2.6) as described in the previous section. At the next time step t + At a new measurement of the target contour becomes available, while the control action specified at time t has moved the vehicle along the connecting contour. In the absence of noise and uncertainty, such a control is guaranteed to track the connecting contour with no error. However, due to uncertainty, noise and delays in the computation, at time t + At the vehicle is n o t going
126
Ruggero Frezza, Giorgio Picci, and Stefano Soatto
to be exactly on the connecting contour planned at t, which therefore ceases to be a feasible trajectory. The idea, then, is to simply plan a new connecting contour. In a continuous framework (when ~t ~ 0), this strategy results in a control action that effectively moves the vehicle along the envelope of connecting contours (see figure 2.3). Therefore, the effective connecting contour now ceases to be rigid, for its shape is updated at each instant in response to the information provided by the new measurements of the target contour. In particular, % no longer satisfies (1.7), nor does Xc satisfy (1.3). Notice also that this controller is specified entirely in the moving frame, without knowledge of the inertial shape of the contour, thereby the name "Lagrangian" given to this control technique.
ro
Y
X
Fig. 2.3. I n the feedback control law proposed in section 2.3, the vehicle m o v e s along the envelope o f connecting contours.
We illustrate this concept on the simple case of the wheel, although the controller can be generalized to more complex models, as we discuss in section 4.. In the moving frame (x, y) the contour q,(x, t) evolves according to (1.7), while the connecting contour Vc must satisfy instantaneously the minimal set of conditions (2.5). Therefore, the simplest connecting contour has the form
7c(X, t) = c~(t)x3 + fl(t)x 2
(2.7)
where =
Xc
Z(t) = 3 "Y(xc' t) X2
'(xo,O + - X2
(2.s)
V'(xc, t)
(2.9)
Xc
A Lagrangian Formulation of Nonholonomic Path Following
127
The control we propose is proportional to the curvature of the (instantaneous) connecting contour at the origin:
w(t) "- v(t) 02% (0, t) = 2 vx2 (3V(xc,t)
-
xcv'(xc,t))
(2.10)
Note that, once the above control is specified, the trajectory depends upon the nature of the contour 7, the longitudinal velocity v and the distance of the connecting point Xc. The latter two are additional control parameters that can be chosen for performance.
3. A n a l y s i s 3.1 L o c a l s t a b i l i t y
To study stability, we expand the control law (2.10) in the local variables ~i w(t) = v(6~1~ - - ~ ) -b 4 ~2(t) -b ~3(t) + O(xc2)) Xc
(3.1)
Xc
where O(x~) are terms of order greater or equal to x 2c. Substituting (3.1) in (2.1), we obtain the following dynamics
{41 = v(~2 + hi (~1, ~2,..., ~i,...)) =
-
+ h2(
(3.2)
1,
While the poles of the linear part are stable, by itself this is not sufficient to guarantee local stability. However, the linearization about ~1 = 0 and ~2 = 0 is -
~-
4 v
(3.3)
since the O(xc2) terms can be made arbitrarily smM1 adjusting xc, both terms in the second row are positive which implies that linearized system is stable and, therefore, the nonlinear system is locally stable. Dividing both equations (3.2) by v, it is clear that the linear part implies asymptotic convergence with respect to arc-length. The location of the poles in the complex plane is a function of the look-ahead distance Xc, in particular, the magnitude of the poles is inversely proportional to Xc while, surprisingly, the damping does not depend on xc. A small Xc means faster convergence, but it implies large actuation efforts, as it can be seen in (3.1), and a short predictive horizon.
128
Ruggero Frezza, Giorgio Picci, and Stefano Soatto
4. C h o i c e
of the norm:
automatic
maneuvering
The control strategy we have proposed in the previous section relies on the property that differentially flat systems have of being path invertible [6]. W h a t it means is that assigning a control law is equivalent to assigning a path in the flat outputs, since the controls can be computed as a function of the flat outputs and their derivatives. This is what makes possible to formulate tracking as a constrained approximation problem. The potentials of this idea do not fully come through in the simple example of controlling the trajectory of a single wheel. In this section we consider a slightly more articulated model, t h a t is a bi-cycle, and show that the control generates interesting and non-trivial trajectories. The bi-cycle is a slightly more realistic model of a car than just a wheel: a moving frame centered on the mid-point of the rear axle of the car satisfies the following kinematic model { i/'(t) - vrt ~ [cos(8)] T(0) = 0 ~ / [sin(8)J (4.1) ~(t) = ~ tan(j3(t)) 8(0) = 0 -
-
= u(t)
Z(o) = o
here ~ is the steering angle of the front wheels and l is the distance between the front and the rear axles. The model (4.1) is said to be fiat of order 2, since the flat outputs T(t) must be differentiated twice in order to get the whole state, ~ and/~ in particular. In order to be a feasible trajectory, the connecting contour 7c defined in section 2.2 must satisfy the following minimal set of conditions:
{
%(0, t) = 0 V'c(0,t)i= 0 t) = (xc, t)
~/~l(O,t)=tan(~(t))/l t) = t) t) =
Vt. t)
(4.2) Real vehicles often have restrictions on the steering angle, for instance I~1 -< B < r / 2 . This makes things more complicated, for feasible trajectories must have a curvature bounded by tan(B)/l. Therefore, tracking an arbitrary contour with no error is impossible unless such a contour has a curvature less than tan(B)~1 everywhere. It is possible, however, to minimize the tracking error by acting on the longitudinal velocity v, which up to now we have not exploited. In particular, allowing a reversal of v makes it possible to generate singular connecting contours, with cusps corresponding to v = 0. A cusp in the trajectory corresponds to the location where che vehicle performs a maneuver by stopping and reversing its direction. From this viewpoint, B-splines represent a desirable class of connecting contours since they can handle cusps in a natural way when two nodes coincide [1]. The controller, therefore, is left with the duty of choosing the look-ahead distance Xc so that the constraint on the steering angle is satisfied. Since we have no choice than accepting tracking errors, a natural question arises of what is the correct norm to minimize. Trying to minimize the
A Lagranglan F o r m u l a t i o n o f N o n h o l o n o m i c P a t h F o l l o w i n g
129
~l.$raoot
~s 5 4
0
I
O2
0"18 0.1I6 014 g. ol
o04 0o2
Fig. 4.1. When there are limits on the m a x i m u m steering angle, not all contours can be followed with no error. Allowing the longitudinal speed v to reverse, and minimizing the c~-norm of the distance from the target trajectory, one sees the controller automatically generating maneuvers. The number o / m a n e u v e r s increases with the curvature of the target contour. In the simulation experiment, the target trajectory (dotted line) is a spiral, and the trajectory of the vehicle (solid line) starts maneuvering when the curvature of the contour exceeds the limits on the steering angle.
s norm of the tracking error may not seem natural, since the best approximating trajectory may contain many cusps implying lots of maneuvering. It seems more natural to the problem of car-driving to keep the tracking error bounded and satisfy the task of staying within a lane centered about the unknown contour. The problem is, then, finding an w oo approximation of the observed portion of unknown contour with feasible trajectories. In figure 4.1 we show the results of a simulation of a bi-cycle trying to follow a spiral trajectory. Before the vehicle hits the maximum steering angle, the trajectory is followed without maneuvers. However, when the curvature of the target path exceeds the one allowed by the actuators, the vehicle starts
130
Ruggero Frezza, Giorgio Picci, and Stefano Soatto
vehk~
tra~,eacy
I f
3
4 xlml
5
1 ~/ o.8 0.7
o.ol
Fig. 4.2. When the target contour has singularities, one cannot achieve perfect tracking even in the absence of uncertainty. In an L ~ framework, one can impose that the trajectory remains within a specified distance from the target. In this figure the distance is required to be below l m , which can be satisfied without hitting the steering limit.
maneuvering at a frequency t h a t increases with the curvature. Note t h a t the control performs this "decision" autonomously, without the need to switch to a different control strategy. In figure 4.2 we show the results of a simulation of a car following a square while maintaining the tracking error below l m . T h e constraints on the steering angle make it possible to satisfy such bounds without requiring maneuvers. If, however, we d e m a n d t h a t the t r a j e c t o r y be followed with an error of less t h a n 0.3m, then the controller is forced to m a n e u v e r (figure 4.3). In figure 4.4 we show the results of an experiment performed to validate the results just described on a real experiment, where a toy car is controlled so as to follow a t r a j e c t o r y with cusps. As it can be seen, despite the rudiment a r y model used for the kinematics and the lack of a dynamical model, the
A Lagrangian Formulation of N o n h o l o n o m i c P a t h Following
1
2
3
4
5
8
131
7
x [m] &'a~clory ~ o f
1
O~ ~., o.7
~
0.5
O.3 0.2
0"11
~
loo
9 2OO
ao0
4OO
500
eO0
700
8OO
~
1000
Fig. 4.3. When the bounds on the tracking error are small and there are bounds on the steering angle, it is necessary to maneuver. Modeling the connecting contour using B-splines, and allowing the reversal of the translational velocity, one sees the controller generating maneuvers that maintain the trajectory within the specified bounds.
vehicle exhibits a behavior qualitatively similar to the simulations. Since the admissible tracking error is 0.2m, the vehicle is forced to multiple maneuvers around the corners.
5. Conclusions We have presented a novel control strategy for a non-holonomic vehicle to track an arbitrary contour. Such a control exhibits non-trivial behaviors such as maneuvering, and can track to a specified degree of accuracy an arbitrary path. While the proposed strategy has proven promising in several experiments and simulations, its potentials for generalization to wider classes of nonlinear
132
Ruggero Frezza, Giorgio Picci, and Stefano Soatto
os 04
O2
oI
-ol ~6
3
35
4 x [m]
09f
o.a
o.7
lOO
20o
300
400
500
60o
700
800
goo
lO~O
ema
Fig. 4.4. Experiment with a real vehicle following a target contour with singularities. Despite the rudimentary model adopted, the vehicle exhibits a performance qualitatively similar to the simulations, with automatic maneuvers to satisfy the bounds on the tracking errors.
control systems is still being investigated; the analysis of the stability, controllability and performance of such control strategies is still in its infancy. We have shown a stability analysis for the simple case of a rolling wheel.
References 1. C. De Boor, "A Practical Guide to Splines", Springer-Verlag, 1978. 2. E. D. Dickmanns and V. Graefe, "Dynamic monocular machine vision", Machine Vision and Application, vol. 1, pp. 223-240, 1988. 3. E. D. Dickmanns and V. Graefe, "Applications of dynamic monocular machine vision", Machine Vision and Application, vol. 1, pp. 241-261, 1988. 4. E. D. Dickmanns and B. D. Mysliwetz, "Recursive 3-d road and relative egostate estimation". IEEE Transactions on PAMI, 14(2): pp. 199-213, Feb. 1992. 5. O. Faugeras, "Three-dimensional vision: a geometric viewpoint". MIT Press, 1993.
A L a g r a n g i a n F o r m u l a t i o n o f N o n h o l o n o m i c P a t h Following
133
6. M. Fliess, J. L~vine, P. Martin and P. Rouchon, "Design of trajectory stabilizing feedback for driftless flat systems", In Proceedings Int. Conf. ECC'95, pp. 18821887, Rome, Italy, Sep. 1995. 7. R. Frezza and G. Picci, "On line path following by recursive spline updating", In Proceedings of the 34th IEEE Conference on Decision and Control, vol. 4, pp. 4047-4052, 1995. 8. R. Frezza and S. Soatto, "Autonomous navigation by controlling shape", Communication presented at MTNS'96, St. Louis, June 1996. 9. R. Frezza, S. Soatto and G. Picci, "Visual path following by recursive spline updating". In Proceedings of the 36th IEEE Conference on Decision and Control, San Diego, CA, Dec. 1997. 10. Yi Ma, J. Kosecka and S. Sastry, "Vision guided navigation for nonholonomic mobile robot". To appear in Proceedings of the 36th IEEE Conference on Decision and Control, San Diego, CA, Dec. 1997. 11. R. Murray and S. Sastry, "Nonholonomic motion planning: steering using sinusoids", IEEE Transactions on Automatic Control, 38 (5), pp. 700-716, May, 1993. 12. R. Murray, Z. Li and S. Sastry, A Mathematical Introduction to Robotic Manipulation. CRC Press Inc., 1994. 13. C. Samson, M. Le Borgne and B. Espiau, Robot Control The Task Function Approach. Oxford Engineering Science Series. Clarendon Press, 1991. 14. G. Walsh, D. Tilbury, S. Sastry and J. P. Laumond, "Stabilization of trajectories for systems with nonholonomic constraints". IEEE Transactions on Automatic Control, 39(1): pp. 216-222, Jan. 1994.
V i s i o n G u i d e d N a v i g a t i o n for a N o n h o l o n o m i c Mobile Robot Yi Ma, Jana Kovseck~, and Shankar Sastry Electronics Research Laboratory University of California at Berkeley Berkeley, CA 94720 USA
1. I n t r o d u c t i o n This contribution addresses the navigation task for a nonholonomic mobile robot tracking an arbitrarily shaped ground curve using vision sensor. We characterize the types of control schemes which can be achieved using only the quantities directly measurable in the image plane. The tracking problem is then formulated as one of controlling the shape of the curve in the image plane. We study the controllabity of the system characterizing the dynamics of the image curve and show that the shape of the curve is controllable only upto its "linear" curvature parameters. We present stabilizing control laws for tracking both piecewise analytic curves and arbitrary curves. The observability of the curve dynamics is studied and an extened Kalman filter is proposed to dynamically estimate the image quantities needed for feedback controls. We concentrate on the kinematic models of the mobile base and comment on the applicability of the developed techniques for dynamic car models. Control for steering along a curved road directly using the measurement of the projection of the road tangent and it's optical flow has been previously considered by Raviv and Herman [9]. Stability and robustness issues have not been addressed, and no statements have been made as to what extent these cues are sufficient for general road scenarios. A visual servoing framework proposed in [2] addresses the control issues directly in the image plane and outlines the dynamics of certain simple geometric primitives (e.g. points, lines, circles). The curve tracking and estimation problem originally outlined in Dickmanns [1], has been generalized for arbitrarily shaped curves addressing both the estimation of the shape parameters as well as control in [3] by Frezza and Picci. They used an approximation of an arbitrary curve by a spline, and proposed a scheme for recursive estimation of shape parameters of the curve, and designed control laws for tracking the curve. For a theoretical treatment of the image based curve tracking problem, the understanding of the dynamics of the image of an arbitrary ground curve is crucial.
Vision Guided Navigation for a Nonholonomic Mobile Robot 2. C u r v e
135
Dynamics
In this section we derive image curve dynamics under the motion of a groundbased mobile robot. In the following, only the unicycle model is studied in detail. We will later comment on generalization to other mobile robot models. Let PIr, = (x, y, z) T E IR3 be the position vector of the origin of the mobile frame Fm (attached to the unicycle) from the origin of a fixed spatial frame FI, and O E lR be the rotation angle of Frn with respect to FI, defined in the counter-clockwise sense about the y-axis, as shown in Figure 2.1. For
Ff"T
z
Fig. 2.1. Model of the unicycle mobile robot. unicycle kinematics, one has:
,
ps-~ =
0 = w
(2.1)
cos0]
where the steering input w controls the angular velocity; the driving input v controls the linear velocity along the direction of the wheel. A camera with a unit focal length is mounted on the mobile robot facing downward with a tilt angle r > 0 and elevated above the ground by a distance d, as shown in Figure 2.2. The camera coordinate frame Fe is such that the z-axis of Fe is the optical axis of the camera, the x-axis of Fc is that of F,~, and the optical center of the camera coincides with the origins of Frn and Ft. From (2.1), the velocity of a point q attached to the camera frame Fe is given in the (instantaneous) camera frame by:
(i)
=
s i ne
v+ |
\cosr
\
-xsinr
w.
(2.2)
-xcosr
In order to simplify the notation, we use the abbreviations sr cr ctr and tr to represent sin r cos r cot r and tan r respectively.
136
Yi Ma, Jana Kovseck~, and Shankar Sastry
i/Y
Fm ,, / Fc I/
/ I m a g e Plane / z=l
Fig. 2.2. The side-view of the unicycle mobile robot with a camera facing downward.
For the rest of this paper, unless otherwise stated, we make the following assumptions: 1. the given ground curve E is analytic; 2. the ground curve 1" is such that it can be parameterized by y in the camera coordinate frame Ft. Assumption 1 means that 1" can be locally expressed by its convergent Taylor series expansion. Assumption 2 guarantees that the task of tracking the curve 1" can be solved using a smooth control law, since it avoids the critical case that the curve is orthogonal to the heading of the mobile robot. According to Assumption 2, at any time t, the curve 1" can be expressed in the camera coordinate frame as (~%(y, t), y, ~z(Y, t)) T E ]a 3. Since F is a planar curve on the ground, "Yz(Y, t) is given by: "Yz(Y, t) = d+v sincos r r which is a function of only y. Thus only ~/~(y, t) changes with time and determines the dynamics of the ground curve. For the image curves, people usually consider two types of projections: orthographic or perspective projection. It can be shown that in the above setting, as long as the tilt angle r > 0, there is a simple diffeomorphism between these two types of projection images (for a detailed proof and an explicit expression for the diffeomorphic transformation see [6]). Consequently, the dynamics of the orthographic projection image curve and t h a t of the perspective one are algebraically equivalent. Further on we will use the orthographic projection to study our problem. The orthographic projection image curve of F on the image plane z = 1 is given by (Tx(y,t),y) T E lR2, denoted by/~, as shown in Figure 2.3. We define: ~i+l ~- OiT~(Y't) e ]It, Oyi
~ ---- (~1,~2,.
9
9 ,~i) T E ]a i,
~ ~-- ( ~ 1 , ~ 2 , . . ) T .
e ]R ~176
Since ~/,(y,t) is an analytic function of y, 7x(y,t) is completely determined by the vector ~ evaluated at any y.
Vision Guided Navigation for a Nonholonomic Mobile Robot
137
Y
z
z=l
x
F~ Fig. 2.3. The orthographic projection of a ground curve on the image plane. Here sol = 7z and ~2 = 00-~-. 2.1 Dynamics
of General
Curves
While the mobile robot moves, a point attached to the spatial frame F I moves in the opposite direction relative to the camera frame Ft. Thus, from (2.2), for points on the ground curve F = (Tx(Y, t), y, 7z(y)) T, we have: (2.3)
~/, (y, t) = - (y sin r + 7z cos r Also, by chain rule:
-~- +
(2.4)
( - ( v s r - 7x~sr
The shape of the orthographic projection image curve then evolves in the image plane according to the following Riccati-type partial differential equation
[3]: 07x _ ot
(~sr + 7zcr
+
(vsr -
7~sr
(2.5)
Using the notation ~ and the expression for %, this partial differential equation can be transformed to an infinite-dimensional dynamic system ~ through differentiating equation (2.5) with respect to y repeatedly: (2.6)
=/1~ + Av where f l E IR~ and ]2 E ]R~176 are: (~1~2sr + d c t r +
I 6sr 6sr ~sr
6 6 s r + ~sr + ~;z fl=--
:
~1~iq-18r "~- gi
,
~=
'
(2.7)
138
Yi Ma, Jana Kovseck~, and Shankar Sastry
and gi are appropriate (polynomial) functions of only ~2,---, ~i. C o m m e n t s It may be argued that the projective or orthographic projections induce a diffeomorphism (so-called homography, in the vision literature (see for example Weber et al [10])) between the ground plane and the image plane. Thus, we could write an equation for the dynamics of the mobile robot following a curve in the coordinate frame of the ground plane instead of the image plane. These could be equivalent to the curve dynamics described in the image plane through the push forward of the homography. We have not taken this point of view for reasons that we explain in Section 3.. While in the general case system (2.7) is infinite-dimensional, for a special case of a linear curvature curve (i.e. the derivative of its curvature k(s) with respect to the arc-length parameter s is a non-zero constant) the curve dynamics can be simplified substantially. L e m m a 2.1. For a linear curvature curve, any ~i, i > 4 can be expressed as a function of only ~1,~2, and ~3. Especially, ~4 iS given as: c(~+~])3/a+3~] ~4 = a:+~i See [6] for a detailed proof. The dynamics of the image of a linear curvature curve is thus simplified to be a three-dimensional system: 43
=
f~w + f32v
(2.8)
where f~ E ]Rs and f3 E IRs are:
(2.9) \
elSr + 3 2 3sr
and ~a is given in the above lemma.
3. C o n t r o l l a b i l i t y
Issues
We are interested in being able to control the shape of the image curves. In the unicycle case, this is equivalent to control the systems (2.6) or (2.8) for general curves or linear curvature curves, respectively. Using the homography between the image plane and the ground plane the controllability could be studied on the ground plane alone. However we have chosen to use vision as a image based servoing sensor in the control loop. Studying the ground plane curve dynamics alone does not give the sort of explicit control laws that we will obtain. Our task is to track the given ground curve F. Note that ~ is still a function of y besides t. It needs to be evaluated at a fixed y. According to Figure 2.2 and Figure 2.3, when the mobile robot is perfectly tracking the given curve F, i.e., the wheel keeps touching the curve, the orthographic image curve should satisfy:
Vision Guided Navigation for a Nonholonomic Mobile Robot
7~(y,t)ly=-dcoso =_o
139 (3.1)
o~.oy (~,0 Jy=-dcos ~ -- 0
Thus, if ~ is evaluated at y = - d c o s r the task of tracking F becomes the problem of steering both ~1 and ~2 to 0. For this reason, from now on, we always evaluate ~ at y -- - d c o s r unless otherwise stated. T h e o r e m 3.1. ( L i n e a r C u r v a t u r e C u r v e C o n t r o l l a b i l i t y ) Consider the system (2.8). If r ~t O, and y = - d c o s r then the distribution As spanned by the Lie algebra s f3) is of rank 3 when the linear curvature c 7t O, and
is of rank 2 when c = O. The proof of this theorem is by directly calculating the controllability Lie algebra for system (2.8) (see [6] for details). According to Chow's Theorem [8], the local reachable space of system (2.8) of ~3 is of 3 dimensions. Actually, for a general curve which is not necessarily of linear curvature, one still can show that the shape of the image curve is controllable only up to its linear
curvature parameters
~3:
The locally reachable space of ~ under the motion of an arbitrary ground-based mobile robot has at most 3 dimensions.
T h e o r e m 3.2. ( G e n e r a l C u r v e C o n t r o l l a b i l i t y )
Similar results can be obtained for the model of a front wheel drive car as shown in Figure 3.1. The kinematics of the front wheel drive car (relative s" s"
Fig. 3.1. Front wheel drive car with a camera mounted above the center O. to the spatial frame) is given by = sin 9ul = cos/~Ul /~ = 1-1 tan a u l ~--u2
(3.2)
140
Yi Ma, Jana Kovseck~, and Shankar Sastry
Comparing (3.2) to the kinematics of the unicycle, we have: w ----l-1 tan c~ul, v -- Ul. From the system (2.6), the dynamics of the image of a ground curve under the motion of a front wheel drive car is given by
0
---- 1-1 t a n ~ f l -t- f2
) ul + (1) u2
---- /1~1 -~- L U 2 .
(3.3)
By calculating the controllability Lie algebra for this system, one can show that the controllability for the front wheel drive car is the same as the unicycle. As a corollary to Theorem 3.1 and 3.2, we have
Corollary 3.1. For a linear curvature curve, the rank of the distribution spanned by the Lie algebra generated by the vector fields associated with the system (3.3) is exactly 4. For constant curvature curves, i.e., straight lines or circles, the rank is exactly 3. For general curves, the image curves are controllable only up to its linear curvature terms. Comments The model of the front wheel drive car has the same inputs and same kinematics as the bicycle model typically used in driving applications which require dynamic considerations [5]. In the dynamic setting the bicycle model lateral and longitudinal dynamics are typically decoupled in order to o b t a i n two simpler models. The lateral dynamics model used for design of the steering control laws captures the system dynamics in terms of lateral and yaw accelerations. The control laws derived using this kinematic model are directly applicable to the highway driving scenarios under normal operating conditions when the dynamics effects are not so dominant.
4. Control
Design
in the
Image
Plane
We already know that in the linear curvature curve case, for unicycle, the dynamics of the image is described by system (2.8), which is a two-input three-state controllable system. According to Murray and Sastry [8], such a system can be transformed to the canonical chained-form. Similarly, in the linear curvature curve case, one can show that, for the car-like model, the image dynamics (3.3) is also convertible to chained-form [6]. For chained-form systems, one can arbitrarily steer the system from one point to another using piecewise smooth sinusoidal inputs [8]. T h a t is, locally one can arbitrarily control the shape of the image of a linear curvature curve.
4.1 Tracking Ground Curves Although one cannot fully control the shape of the image of an arbitrary curve, it is possible for the mobile robot to track it. When the robot is perfectly tracking the given curve, i.e., ~1 = ~2 -- 0, from (2.8) we have: ~2 ~- --~3 v sin r + w / s i n r -- 0.
This gives the perfect tracking angular velocity: w = (3 sin 2 Cv.
(4.1)
Vision Guided Navigation for a Nonholonomic Mobile Robot
141
T h e o r e m 4.1. ( T r a c k i n g C o n t r o l L a w s ) Consider closing the loop of system (2.6) with control (w,v) given by: ~v : ~382r + s2r + Kwh2 v = vo + s2r (~1 + ~3)Vo - Kv~2sign(~l + ~3)
(4.2)
where K~,, Kv are strictly positive constants. The closed-loop system asymptotically converges to the subset: M = {~ E R ~176: ~1 = ~2 = 0} for initial conditions with ~1 and ~2 small enough. Once on M , the mobile robot has the given linear velocity vo and the perfect tracking angular velocity w0 = ~3 sin 2 CVo. For the proof of this theorem see [6]. Notice that the control law only relies on the linear curvature parameters. This observation later helps us to design simplified observer for the system. One may also notices t h a t the control law is not a smooth one. However, when the maximum curvature of the curve is bounded, Kv can be zero and the control law becomes smooth (see the following Corollary 4.1). Although Theorem 4.1 only guarantees local stability, it can be shown by simulation that, with appropriately chosen K . and K~, the tracking control law (4.2) has a very large domain of attraction.
Corollary 4.1. (Tracking C l - s m o o t h 1 P i e c e w i s e A n a l y t i c C u r v e s ) Consider an arbitrary C 1-smooth piecewise analytic curve. If its m a x i m u m curvature kmax is bounded, then, when K~, > 0 and Kv >_ 0 the feedback control law given in (5.2) guarantees that the mobile robot locally asymptotically tracks the given curve. Corollary 4.1 suggests that, for tracking an arbitrary continuous curve (not necessarily analytic), one may approximate it by a CX-smooth piecewise analytic curve, a virtual curve, and then track this approximating virtual curve by using the control law (4.2). For more details and illustrative examples see [6]. The simulation result of tracking a linear curvature curve (k'(s) = - 0 . 0 5 ) is given in Figure 4.1. Here, we choose r = 7r/3, and K~, = 1, K . = 0.5, and vo = 1. The initial position of the mobile robot is zf0 = 0, Xfo = 0 and
00 = 0.
5. Observability
and
Estimation
Issues
Suppose, at each instant t, the camera provides N measurements of the image curve/~: I : {(Tx(yk,t),yk) : k = 1 , . . . , N } , where {Yl,y2,..-,YN} are fixed distances from the origin. Since it is not accurate at all to directly use difference formula to estimate ~2, ~3 from noisy ~1, it is appealing to dynamically estimate all ~3 from I. Using only the measurement ~1 = 7x(Y, t) as the output of the vision sensor, for general curves the sensor model is:
1 C 1.smooth means that the tangent vector along the whole curve is continuous.
142
Yi Ma, Jana Kovseck~, and Shankar Sastry
o,
L
o.1 n
'~
~ ~176 o05
-10
-5
0
....
J/
:
L
~=1
.....
::
"
i
...... !
20
z_f
........
~i_2
i
0
;
.....
!. . . . .
40
60
40
6O
Tm~t
1.2
i
1.15 1.1
J~ 1.05 1
.................. i 20
4O
80
Time t
2O "I'Ve t
Fig. 4.1. Subplot 1: the trajectory of the mobile robot in the spatial frame; Subplot 2: the image curve parameters ~t and ~2; Subplot 3 and 4: the control inputs v and ~J.
: / t w + f2v h(~) = ~z
(5.1)
T h e o r e m 5.1. ( O b s e r v a b i l i t y o f t h e C a m e r a S y s t e m ) Consider the system given by (5.1). I/ r ~ O, then the annihilator Q of the smallest codistribution ~ invariant under fl, f2 and which contains dh is empty. The proof is by directly calculating the codistribution f2 (see [6]). According to nonlinear system observability theory [4], the system (5.1) is observable. Ideally, one then can estimate the state ~ from the output h(~). However, the observer construction may be difficult for such infinite dimensional system. Note, according to Theorem 4.1, that one only needs the linear curvature parameters i.e., ~3 to track any analytic curve. All the higher order terms ~, i > 4 are not necessary. This suggests using the linear curvature curve dynamics (2.8) to build an applicable observer. Since we do not suppose to have any a priori knowledge about the linear curvature c = k' (s), it also needs to be estimated. For linear curvature curves the simplified sensor model is:
h(~ 3, c) = ~z T h e o r e m 5.2. ( O b s e r v a b i l i t y o f t h e S i m p l i f i e d S e n s o r M o d e l ) Consider the system (5.2). I] r ~ O, then the smallest codistribution [2 invariant under f l3, f23 and which contains dh has constant rank 4. The proof is similar to that of the general case (see [6]).
Vision Guided Navigation for a Nonholonomic Mobile Robot
143
The simplified sensor model (5.2) is a nonlinear observable system. We here use the widely applied extended Kalman filter (EKF) [7] to estimate the states of such systems. In order to make the EKF converge faster, we need to use multiple measurements instead of using only one. An EKF is designed (for the detailed algorithm see [6]) to estimate the states ~3 and ~ of the following stochastic system:
hk(~3,c) = ~l(Yk) +Phi,
k = 1,... , N
where #c and #h~ are white noises with appropriate variances. Simulation results (see [6]) show that the designed EKF converges faster when using more measurements.
6. C l o s e d - L o o p
System
We have so far separately developed control and estimation schemes for a mobile robot to track given curves using vision sensors. Combining the control and estimation schemes together, we thus obtain a closed-loop vision-guided navigation system. Simulation results show that the tracking control and the estimation schemes work well with each other in the closed-loop system. For illustration, Figure 5 presents the simulation results for tracking a circle.
7. D i s c u s s i o n
and Future
Work
Based on the understanding of the of image curve dynamics, we proposed control laws for tracking an arbitrary curve using the quantities measurable in the image. More generally our study indicates that the shape of the image curve is controllable only up to its linear curvature terms (in the ground-based mobile robot case). The proposed tracking control law might be just one of many possible solutions, we will not be surprised if people find other ones in practice. But the existence of such a stabilizing control law is of extreme theoretical importance. Fhrther that the control law depends only on the linear curvature terms shows again the important role that linear curvature curves play in the navigation problem. Although visual servoing for ground-based mobile robot navigation has been extensively studied, its applications in aerial robot have not received much attention. In the aerial robot case, the motions are 3-dimensional rigid body motions SE(3) instead of SE(2) for ground-based mobile robots whence one loses the fixed homography between the image plane and the ground plane. A study of the 3-dimensional case is in progress. It is an important topic for applications in autonomous helicopter or aircraft navigation.
144
Yi Ma, Jana Kovseck~, and Shankar Sastry
o.06~.
0.05
~, o~
• -0.05
: :
i o.o r........
a
-0.1
o2I 200
400 600 800 T i m e step = 0.01
1000
0.2
"6 0.1
~0.05
200
i......... i....... .......
............ 200
.........
o
i ........
"
........ 400 600 800 T i m e step = 0.01
1000
i
~
t
.~015
0
:
0.1 . . . .
~. . . . . . . . :. . . . . .
I
:. . . . . . . . . ::. . . . . . ]
~" 0.05 I/~i ........ i-........i........ i ..... 400 600 800 Time step : 0.01
1000 Time step = 0.01
0.15 0.1
~
: i i i .... ! . . . . . . . . i. . . . . . . . !. . . . . . . . . i 9 9 9
0.05
~ o -0.05
.iiiiiiiiiiil}il}i}iiiiiiiiiiiiiiiiiiiiill
-0.1
200
-9.6
400 600 800 Time step = 0.01
...... i
-9.7
.........
i ..........
i ........
-9.9
......
i
;
0
1000
0
200
400 600 800 Time step = 0.01
1000
i ........ i
-9.8;
-16 -1
0.50[-' .....................................
1
zA
>-f . . . . . . . . . . /
:
2
3
-0.5
0.5 -X
Fig. 6.1. Simulation results for the closed-loop system. In Subplot 7 the solid curve is the actual mobile robot trajectory and the dashed one is the nominal trajectory. Subplot 8 is an image of the nominal trajectory viewed from the camera.
Acknowledgement. This work was supported by A R e under the MURI grant DAAH04-96-1-0341. We would like to thank Dr, Stefano Soatto for the formulation of this problem given in the AI/Robotics/Vision seminar at UC Berkeley, October 1996.
References 1. E. D. Dickmanns and V. Graefe. Applications of dynamic monocular machine vision. Machine Vision and Applications, 1(4):241-261, 1988.
Vision Guided Navigation for a Nonholonomic Mobile Robot
145
2. B. Espian, F. Chanmette, and P. Rives. A new approach to visual servoing in robotics. IEEE Transactions on Robotics and Automation, 8(3):313 - 326, June 1992. 3. R. Frezza and G. Picci. On line path following by recursive spline updating. In Proceedings of the 3~th IEEE Conference on Decision and Control, volume 4, pages 4047-4052, 1995. 4. Alberto Isidori. Nonlinear Control Systems. Communications and Control Engineering Series. Springer-Verlag, second edition, 1989. 5. J. Kovsecks R. Blasi, C.J. Taylor, and J. Malik. Vision-based lateral control of vehicles. In Proc. Intelligent Transportation Systems Conference, Boston, 1997. 6. Yi Ma, Jana Kovseck~, and Shankar Sastry. Vision guided navigation for a nonholonomic mobile robot. Electronic Research Laboratory Memorandum, UC Berkeley, UCB/ERL(M97/42), June 1997. 7. Jerry M. Mendel. Lessons in Digital Estimation Theory. Prentice-Hall Signal Processing Series. Prentice-Hall, first edition, 1987. 8. Richard M. Murray, Zexiang Li, and Shankar S. Sastry. A Mathematical Introduction to Robotic Manipulation. CRC press Inc., 1994. 9. D. Raviv and M. Herman. A "non-reconstruction" approach for road following. In Proceedings of the SPIE, editor, Intelligent Robots and Computer Vision, volume 1608, pages 2-12, 1992. 10. J. Weber, D. Koller, Q. T. Luong, and J. Malik. An integrated stereo-based approach to automatic vehicle guidance. In Proceedings of IEEE International Conference on Computer Vision, pages 52-57, June 1995.
Design, Delay and Performance in Gaze Control: Engineering and Biological Approaches Peter Corke CSIRO Manufacturing Science and Technology PO Box 883 Kenmore, Australia, 4069.
Summary. In this paper published control models for robotic and biological gaze control systems are reviewed with an emphasis on dynamics characteristics and performance. The earlier work of Brown[2] is extended, taking into account more recent neurophysiological models and high-performance visual servoing results. All the models are compared using common notation and diagrammatic framework which clearly show the essential commonalities and differences between approaches.
1. I n t r o d u c t i o n High-performance visual tracking, or gaze control systems, have been developed by evolution in the primate oculomotor system, and much more recently by the robotic, and active vision research communities. Robotic visual servoing is a maturing control paradigm[7] and ongoing technological development now make it feasible to visually servo a robot at video frame rate using a standard desktop computer[7]. Neurophysiologists and neuroscientists have treated biological tracking and fixation responses as classic black box systems and endeavour, using input-output data, to propose models of those systems that are consistent with known retinal and neural physiology - - in effect reverse engineering. The robot visual servoing community has, until recently, concentrated largely on what can be called the kinematics of visual control and system dynamics tend to have been ignored[5]. The relatively poor dynamic performance of reported systems, long settling times and tracking lag, indicates that stability is achieved by detuning not design[4]. Neurophysiologists on the other hand have been hypothesizing dynamic models since the 1960s. They use classical control systems terminology and tools such as time domain plots, Bode diagrams, block diagrams and so on. The problems they face are that the systems they study are extremely complex and the technology (neural structure) is only partially understood. Nonetheless, by means of careful and ingenious experimentation, models have been developed which have the ability to predict aspects of visual behaviour. Although the models proposed for human visual tracking are interesting and sufficient for their task, they are not necessarily the best tracking system
Design, Delay and Performance in Gaze Control
.'
147
9 - Eye c o m m ~ n d
- ~etinal error
'Eye'_..,' T~
=O =
: -
t~rge} L _ "['-" ~ pos,tmn [
Dfln~rnics of ;~::2s:::g ....
Compensator d
D/namic, of eye
l
]
;~s~,ion -
Fig. 1.1. Block diagram of the generic gaze control system9 possible. The tracking performance of the eye exhibits steady-state velocity error and significant oscillation, see Figure 1.2. The design of the biological control system is also strongly influenced by the 'technology' available for 'implementation'. The remainder of this section will introduce issues common to biological and engineered systems such as control structure, delay and performance measures. Section 2. will discuss biological systems, a brief background and then a review of some proposed models. Section 3. will cover engineered visual servoing systems, and finally, conclusions are presented in Section 4.. 1.1 C o n t r o l s t r u c t u r e
Anatomically, fixation or pursuit is configured as a negative feedback control system since the moving retina directly senses tracking error. In this fashion the system structure is similar to the robotic eye-in-hand visual serve system[7]. In order to compare the essential differences between the various approaches (designed and evolved) it is convenient to use the standard block diagram shown in Figure 1.1, which represents a feedback system for one dimensional gaze control. The common feature of this model, and all the variants that follow, is negative feedback due to the leftmost summing junction which models the fact that the retina directly senses tracking error. T represents the target position (actually bearing angle since the eye tracks by rotating), and E is the eye position (again, an angle). Some models use velocity rather than position inputs and outputs, that is, lb and/~. We will use A(t) to denote a pure time delay of t milliseconds. In this discussion we will completely ignore the recognition and spatial processing aspect of visual perception, only the temporal characteristic is considered. Thus visual perception is modeled as a continuous time transfer function V(s) between actual tracking error and the retinal output. M(s) is the transfer function of the eye's motor, between demand signal and gaze angle, and D(s) is the transfer function of a forward path compensator designed (or evolved) to stabilize the closed-loop system.
148
Peter Corke
1.2 T h e c u r s e o f d e l a y The dynamic characteristic that causes the problem for robotic and biological systems is delay which has a destabilizing effect on feedback control systems. Delay has been a key component of neurophysiological models since the 1960s, but was first noted in the visual servoing context nearly a decade later[5]. Sharkey et al.[12] describe six types of delay that can occur in a vision controlled system: controller, actuator, sensor, transport between (sensor to action), communications (sensor to control), and communications protocol variability. Within the vision system itself delay is due to[5] factors such as serial transport of pixels from camera to framestore, finite exposure time of the sensor, and execution time of the feature extraction algorithm. H u m a n retinal processing delay of 50 ms[10] is of a similar order to a video frame time of 40 ms. The deleterious effect of delay can be explained by considering the block diagram structure of Figur e 1.1 with V ( s ) comprising a pure time delay of T, that is, V ( s ) = e -St. This leads to the closed-loop transfer function
E(s___))_ n ( s ) i ( s ) e -8~" T(s) 1 + n(s)i(s)e -st
(1.1)
The time delay term can be substituted by its Pad~ approximation e-'~ ~
1 - st~2 1 + sT~2
(1.2)
and setting the resulting characteristic equation to zero allows us to solve for the maximum loop gain in order to maintain stability. On the root locus diagram we can consider that one of the closed-loop poles moves toward the right-hand plane zero of (1.2). For discrete-time systems z-plane techniques can be used[5].
Target ',
velocity ',
"~i~
~
)
~
Target position
Fig. 1.2. Primate (Rhesus monkey) response to step velocity motion[8].
Design, Delay and Performance in Gaze Control
149
1.3 Control performance An example of the performance of primate gaze control is shown in Figure 1.2. The achievable performance of a robotic gaze control system is related to the sampling rate, which is usually limited by the vision system. If we assume a vision system operating at 60 samples per second (as is common) then a common rule of thumb would lead to an expected closed-loop bandwidth of between one fifth and one tenth of that frequency, which is between 6 and 12 Hz. In time domain terms this would equate to a rise time of between 6() and 120ms. If the system dynamics were simple then most laboratory experiments should easily achieve this level of performance, even through ad hoc tuning. T h a t they do not, indicates that system dynamics are not well understood and the control design is sub-optimal. Any comparison of control performance is of little use without a quantitative performance measure. Since the task, pursuit or fixation, is defined in the image plane then image plane error seems an appropriate measure of performance. For a particular target motion the error could be described in terms of peak, steady-state, RMS or other characteristics. It is extremely important to realize that good performance in one respect, does not necessarily equate to good tracking of a moving object. For visual servoing systems step response and settling error are commonly given but are actually the least useful performance measures for a tracking system. The reason for this is that most closed-loop visual servo systems are by structure of Type I. These systems will, by definition, exhibit zero steady-state error to a step change in target position, but finite error to a ramp or higher-order demand. The T y p e I characteristic is due to the inclusion of an integrator, implicit in a velocity controlled axis, or explicitly in series with a position controlled axis. To measure tracking performance a more challenging trajectory such as a ramp or sinusoid is appropriate[5]. Neurophysiologists use a Rashbass stepramp stimulus, see Figure 1.2 lower right, to investigate smooth pursuit motion. This test has a random choice of direction and starting time in order to confound the biological subject's ability to predict the motion.
2. B i o l o g i c a l
gaze
control
2.1 Background Interestingly the performance and structure of gaze control is species specific and is most advanced in species with foveated vision. Rabbits for example have no foveal region, cats a poorly developed one, while primates have the most highly developed fovea. T h e human fovea has a field of view of around 1~ and a cone photoreceptor density that is 20 times greater than t h a t in the periphery.
150
Peter Corke
The evolutionary advantage of this localized high visual acuity region has necessitated the development of high performance oculomotor control[3]. Clear vision over the entire field of view is achieved by subconscious eye motion that accurately and rapidly directs the fovea to regions of interests. Gaze accuracy varies from 3 ~ if the head is stabilized, to 15 ~ when sitting, to 30 ~ with natural head movements. The control degree of freedom for the eye is rotation within the orbits. For translational motion, as occurs during locomotion, rotation alone cannot stabilize an image of a scene containing objects at varying distances. Some animals, in particular birds, are able to make compensatory linear head movements, while primates use gaze control. Mechanically the human eye is capable of extremely high performance motion, and the muscles that actuate the human eye are the fastest acting in the body[13]. T h e eyeball has low inertia and a low friction 'mounting', and is able to rotate at up to 600 deg/s and 35,000 deg/s 2 for saccadic motion. Only a small number of robotic actuators have been able to achieve this level of performance[13]. The eye has three degrees of rotational motion, but is capable of only limited rotation about the viewing axis (cyclotorsion). Primate gaze control has two modes: saccadic motions to move rapidly to new targets, and fixation to maintain gaze on the target of interest. Here we consider only the issue involved in maintaining gaze on a stationary or moving target. The issues of saccadic motion, gaze planning, and the use of controlled gaze are large topics in their own right which will not be discussed further. Three reflexes, or control circuits, contribute to gaze stability: - optokinetic reflexes (OKR) comprise optokinetic nystagmus (OKN), a sawtooth like eye motion used to stabilize a moving visual field with periodic resets of the eye's position. There are two modes: delayed (also called slow or indirect) which builds up over several seconds and which persists after the stimulation is removed, and early (also called rapid or direct) which leads to the initial rapid rise in OKN. - smooth pursuit where the eye tracks a particular target even in the presence of opposing background motion. vestibulo-ocular reflex (VOR), a powerful reflex that links balance to vision. Head motion information from inertial sensors for rotation (semicircular canals) and translation (otoliths) command compensatory eye motions via a coordinate transformation. feedforward of proprioceptive (measured) signals from neck muscles which control the pose of the head with respect to the body. -
-
-
In humans the early OKN dominates and there is controversy about whether or not this is the same mechanism as smooth pursuit (SP). The human smooth pursuit system is able to match target velocity at up to 15 deg/s with no error, and with increasing error up to a maximum eye velocity of 40 deg/s. Experiments reveal a delay of 130 ms between the onset of target and eye motion. This delay is partitioned as 50 ms for the retinal and neural
Design, Delay and Performance in Gaze Control
151
system and 80 ms for peripheral (muscle) delay. There is physiological evidence that the oculomotor control system can be considered a continuous time system. It seems to be generally agreed that the principal input to the "controller" is retinal slip. There is also evidence[14] that position of the target on the retina is also important, since a pure velocity servo could track perfectly but retain a constant position offset. Goldreich et al.[6] suggest that image acceleration may also be computed and used for gaze control. 2.2 M o d e l s of biological gaze c o n t r o l The early model by Young[14] started from the observation that biological gaze control systems are stable and have high performance despite the presence of feedback, delay and the high gain required for accurate tracking. A model was proposed[Ill in which the negative feedback was cancelled by a positive feedback path, see Figure 2.1. In the event of perfect cancellation by the introduced term the closed-loop dynamics are simply D ( s ) M ( s ) - the open-loop motor dynamics with a series precompensator. Any errors in the parameter of the introduced term, V(s) or/~/(s) will result in imperfect cancellation and lower performance motion. Eliminating negative feedback also eliminates its benefits, particularly robustness to parameter variations. In a biological system these variations may be caused by injury, disease or aging. Robinson proposes that parameter adaptation occurs, modeled by the gain terms P1 and P2, and provides experimental evidence to support this. Such 'plasticity' in neural circuits is common to much motor learning and involves change over time scales measured in days or even weeks.
r'Eye.'..
"Positiv~ feedbacl"
I
Fig. 2.1. The Robinson model [10]. V -- A(50), M = A(30)/(sT~2 + 1), D ---P 1 A ( 5 0 ) / ( s T + 1). The effect of the positive feedback is to create an estimate of the target velocity based on measured retinal velocity and delayed eye velocity command. The feedforward controller of Figure 3.5 is very similar except that target
152
Peter Corke
position is estimated from 'retinal' and motor position information and then differentiated to form the principal component of the motor velocity demand. A significant limitation of this model is that it didn't predict oscillations which are observed experimentally, see Figure 1.2, though some oscillation can be induced by imperfect cancellation with the positive feedback loop. Another, more complex, model proposed by Robinson[10] retains the positive feedback loop, but includes forward path non-linearities and an internal feedback loop. With suitable tuning of the many parameters a good fit with experimental data was obtained. r ....................... put su~--~, 8witch :.~
i ...................
Fig. 2.2. The full Krauzlis-Lisberger model[8]. V = A(60). An alternative model by Krauzlis and Lisberger[8], shown in Figure 2.2, eliminates the positive feedback loop and relies on forward path compensation. Their full model has a compensator with three parallel paths, each comprising a non-linear function, second order filter and a gain. By suitable tuning of the parameters a good fit with experimental data can also be obtained. A linearized version is simply a PD controller, that is, D(s) = K~ + sKd. Goldreich et al.[6] performed an ingenious experiment to determine whether the observed oscillation was due to an internal loop as proposed by Robinson[10], or the "natural" visual feedback loop. The experiment altered the effective retinal delay by using the measured gaze direction to control the position of the target. Increased retinal delay resulted in decreased oscillation frequency as predicted by the Krauzlis-Lisberger model, whereas the Robinson model predicts no change since it is a function of an internal loop. The Krauzlis-Lisberger (KL) model has its own shortcomings. In particular it fails to predict how the damping factor varies with increased retinal delay. Analysis of the Goldreich data by Ringach[9] showed that a curve of damping factor versus delay has the shape of an inverted-U, while the model predicts a monotonic decrease. Pdngach's model, termed "tachometer feedback" and shown in Figure 2.3, is a linear system in which a delayed version of eye velocity rate (acceleration) is fed back. Ringach shows that such a structure is robust with respect to variations in system latency. There is also neurophysiological evidence to support the existence of an eye acceleration
Design, Delay and Performance in Gaze Control
153
r c-~- Y-e-P -
Fig. 2.3. The Ringach "tachometer feedback" model[9]. D = Kp, M = A(vm)/s, DM KtsA(rt), and V = A(rv). =
signal. When the inner-loop delay, Tt, is equal to Tv the model is equivalent to the linearized KL model. Several researchers[10, 8] discuss the possibility of a neural pursuit switch since there is experimental evidence of a difference in the observed dynamics when the eye is fixated on moving or non-moving targets. Such a switch is shown in the KL model of Figure 2.2. If the target motion exceeds a threshold of 3.5 deg/s the pursuit switch is closed for 20 ms.
3. R o b o t i c
g a z e control
The same problems that face biological systems also confront robotic systems. Robots and robotic heads are additionally handicapped by poor dynamic performance, high inertia and significant friction. The classical engineering approach to this tracking problem is to identify the dynamics V(s) and M(s) and synthesize an appropriate controller D(s), see Figure 1.1. Corke[5] examines a number of classical approaches to control design including high gain, increasing system Type, PID, pole-placement control and state feedback. It was found that the closed-loop poles cannot be made arbitrarily fast and are constrained by the practical requirement for compensator stability if the system was to be robust with respect to modeling errors or plant non-linearities. In addition it was found that a fast local axis control loop, position or velocity, is required to achieve acceptable performance given the low visual sampling rate and the non-ideality of a real robot axis. The performance of a simple proportional only feedback control system is shown in Figure 3.1 based on detailed computational and electro-mechanical models[5]. The steady state tracking error is constant, as expected for a Type I system, at 28 pixels. Several researchers[12, 2, 5] have discussed the use of the Smith predictor, a classic technique for systems incorporating time delays. Consider a discretetime plant with dynamics[l]
154
Peter Corke
~ . . . . . .
:
" ".
Z
...-" .....
OL1
o12 o'3
if /
oi,
o!~
Time (a)
o'~
o!~
oi~
o!~
~o
%
oi,
o12 ~
o!,
oi~ o!~ o'7
o18 oi,
Fig. 3.1. Simulated response of proportional feedback visual servo[5] to Rashbass test of 15~
/~J
9'.~r
~ E predictor
I I
Fig. 3.2. The Smith predictor.
r'Ey.e.'.~
~
v -,,~-f~
Fig. 3.3. "Natural Smith predictor" of Shaxkey et al.[12], with velocity loops eliminated for simplicity of comparison.
Design, Delay and Performance in Gaze Control
155
1 B'(z) H(z) =
g ' ( z ) = zd A ' ( z )
(3.1)
where Ord(A') -- Ord(B') and d is the time delay of the system. A compensator, D ' ( z ) , is designed to give desired performance for the delay-free system H ' ( z ) . Smith's control law gives the plant input as g = D ' {Yd -- Y } - D ' H ' {1 - z - d } U
(3.2)
If the plant is partitioned such that all delay is in the vision system then V ( z ) = z - d and M ( z ) = H ' ( z ) the controller is V = D ' { T - E } - D ' M {1 - V} U
(3.3)
which is shown diagrammatically in Figure 3.2. This has the same positive feedback structure as Robinson, but an additional negative feedback loop. Expressing the compensator of (3.2) in transfer function form zdAID I D(z) = zd(A , + B,DO _ B'D'
(3.4)
reveals that open-loop plant pole cancellation is occurring, which makes the controller non-robust with respect to plant parameter variations. Sharkey et al.[12] report experimental gaze control results obtained with the 'Yorick' head. A predictor, based on accurate knowledge of actuator and vision processing delays, is used to command the position controlled actuators. They suggest that their scheme "naturally emulates the Smith Regulator" and Figure 3.3 shows some, but not complete similarity to the Smith predictor of Figure 3.2. Brown[2] used simulation to investigate the application of Smith's predictor to gaze control by means of simulation and thus did not encounter problems due to plant modeling error. His model assumed that all delay was in the actuator, not in the sensor or feedback path. The design constraints inherent in feedback-only control lead to the consideration of feedforward control, which gives additional design degrees of freedom by manipulating system zeros. As shown in Figure 3.4 the introduction of a feedforward term leads to a closed-loop transfer function E T
M (DV + DFF) 1 + VMD
(3.5)
O F F could be selected such that the transfer function becomes unity, and the tracking error would be zero. Such a contro ! strategy is not realizable since it requires (possibly future) knowledge of the target position which is not directly measurable. However this information may be estimated as shown in Figure 3.5. The performance of such a feedforward controller is shown in Figure 3.6 based on similar detailed models as in Figure 3.1. The steady state tracking error after 0.7s is less than • pixels. The initial ringing is due to the particular control and estimator design which was optimized for tracking sinusoidal target motion.
156
Peter Corke
i r'EY~-'-.
I
~"
-f
]
=E
Fig. 3.4. Ideal feedforwaxd system (unrealizable).
~E
,
Fig. 3.5. Implementable feedforward control model[5].
J
-10
0!1
o12
o18
o!,
ols Time
(s)
o!~
o!7
o18
o!9
i
o17
o18
o8
60
2c
-6(]
oi,
o12
o13
o14
ols
Time
(s)
06
i
Fig. 3.6. Simulated response of feedforward visual servo[5] to Rashbass test of 15~
Design, Delay and Performance in Gaze Control
157
4. C o n c l u s i o n This p a p e r has compared a n u m b e r of models of gaze control for biological and engineered systems. The h u m a n eye has had to evolve high-performance gaze control to m a t c h retinal foveation. The presence of delay within a negative feedback system makes it difficult to achieve high tracking accuracy (necessitating high loop gain) and stability. T h e h u m a n gaze control s y s t e m is not yet fully understood but a p p e a r s to have evolved some interesting control strategies which are adequate for the task and have higher p e r f o r m a n c e than simplistic visual servoing systems. However tracking systems based on standard control engineering principles such as increased system T y p e , pole placement or feedforward control are able to d e m o n s t r a t e higher levels of performance t h a n the p r i m a t e smooth pursuit reflex. Target motion reconstruction is a recurring t h e m e in b o t h engineered and evolved gaze control systems.
References 1. K. J. AstrSm and B. Wittenmark. Computer Controlled Systems: Theory and Design. Prentice Hall, 1984. 2. C. Brown. Gaze controls with interactions and delays. IEEE Trans. Syst. Man Cybern., 20(1):518-527, 1990. 3. H. Collewijn. Integration of adaptive changes of the optokinetic reflex, pursuit and vestibulo-ocular reflex. In A. Berthoz and G. M. Jones, editors, Adaptive mechanisms in gaze control: facts and theories, Reviews of oculomotor research, chapter 3, pages 51-69. Elsevier, 1985. 4. P. Corke. Dynamic issues in robot visual-servo systems. In G. Giralt and G. Hirzinger, editors, Robotics Research. The Seventh International Symposium, pages 488-498. Springer-Verlag, 1996. 5. P.I. Corke. Visual Control of Robots: High-Performance visual servoing. Mechatronics. Research Studies Press (John Wiley), 1996. 6. D. Goldreich, R. Krauzlis, and S. Lisberger. Efect of changing feedback delay on spontaneous oscillations in smooth pursuit eye movements of monkeys. J. Neurophysiology, 67(3):625-638, Mar. 1992. 7. S. Hutchinson, G. Hager, and P. Corke. A tutorial on visual servo control. IEEE Transactions on Robotics and Automation, 12(5):651-670, Oct. 1996. 8. R. Krauzlis and S. Lisberger. A model of visually-guided smooth pursuit eye movements based on behavioural observations. J. Computational Neuroseience, 1:265-283, 1994. 9. D. Ringach. A 'tachometer' feedback model of smooth pursuit eye movements. Biological Cybernetics, 73:561-568, 1995. 10. D. Robinson, J. Gordon, and S. Gordon. A model of smooth pursuit eye movement system. Biological Cybernetics, 55:43-57, 1986. 11. D. A. Robinson. Why visuomotor systems don't like negative feedback and how they avoid it. In M. A. Arbib and A. R. Hanson, editors, Vision, Brain, and Cooperative Computation. MIT Press, 1988. 12. P. Sharkey and D. Murray. Delays versus performance of visually guided systems. IEEE Proc.-Control Theory Appl., 143(5):436-447, Sept. 1996.
158
Peter Corke
13. A. Wavering, J. Fiala, K. Roberts, and R. Lumia. Triclops: A high-performance trinocular active vision system. In Proc. IEEE Int. Conf. Robotics and Automation, pages 410-417, 1993. 14. L. Young. Pursuit eye tracking movements. In P. Bach-Y-Rita and C. Collins, editors, The control of eye movements, pages 429-443. Academic Press, 1971.
The Separation of Photometry and Geometry Via Active Vision Ruzena Bajcsy and Max Mintz GRASP Laboratory University of Pennsylvania Philadelphia, PA 19104 USA
1. I n t r o d u c t i o n In this paper we propose an active-vision framework for studying the utility (costs and benefits) of using photometric information to obtain enhanced 3-D scene reconstructions based on polynocular stereo. One of our basic tenants in this work is the principle that: improved information about the pixel-level data yields improved accuracy and reliability in the interpretation of the image. Thus, we are interested in exploiting knowledge of an object's reflectance properties to obtain a better estimate of its geometry. The reflectance properties of an object can have an important impact on the recovery of its geometry. One of our goals is to characterize the tradeoff between recovery accuracy and intrinsic signal processing costs. Since most interesting computer vision problems are intrinsically ill-posed (under-determined) inverse problems, we should strive to obtain more and better data to reduce these uncertainties. In addition to the recovery of scene geometry, other vision problem of interest to us include: the estimation of kinematic models, and the inference of material properties from optical data. These tasks generally require a combination of spatial, spectral, and temporal image data. In order to separate photometry from geometry and accurately recover the scene geometry, we propose a three-phase multi-layer process of refinements: Phase 0: a) Using an active mobile'camera system, estimate the positions of the point sources of the scene illumination. Phase 1: a) Without recourse to specific reflectance information, estimate object surface normals using the polynocular stereo module. b) Within the given viewpoints localize the surface patches which exhibit strong highlights. There are several methods to accomplish this task, including tracking the motion of the highlights by making small movements of camera system.
160
Ruzena Bajcsy and Max Mintz Phase 2: a) Estimate the photometric invariant properties at each pixel. b) Re-evaluate the geometric reconstruction with this additional photometric information.
We continue this process by applying a modified Phase 1 - Phase 2 cycle, where Phase 1 becomes: Phase 1': a) Using the information obtained in the previous application of Phase 2, estimate object surface normals using the polynocular stereo module. b) Within the given viewpoints, localize the surface patches which exhibit strong highlights. We continue this process for a specified number of cycles or until a consistent result is obtained.
2. B a c k g r o u n d 2.1 Photometric Measurements in Computer Vision During the past 25 years most of the algorithmic work in computer vision emphasized scene geometry. The working hypothesis has been that the most important information in the image is contained in the intensity discontinuities in the signal, i.e., the edges in the image. Edges are the basis for most of the contour and/or boundary detections in the scene, which are then interpreted as boundaries of objects. Similarly, edges are the basis for stereo-based depth reconstruction and optical-flow computation. The standard assumption is that: (i) the illumination is either a single or diffuse source; and (ii) the surfaces in the scene are Lambertian, i.e., the reflected light intensity is independent of viewing angle. The standard assumption is often violated because: (i) the scene illumination may be due to multiple point sources and/or multiple diffuse sources; (ii) there may be shadows due to object configurations; (iii) the surfaces of objects in the scene may not be Lambertian, e.g., there may be significant highlights due to specular reflections; and (iv) there may be inter-reflections between objects in the scene. These violations can produce significant errors in image understanding. There have been exceptions to this geometry-based approach. In 1975, Horn, in his AI Memo 335: Image Intensity Understanding and The Facts of Light, [Horn 1975], studied the problem of image formation. This work was continued by his students: Woodham, Sjorberg, and Ikeuchi. They obtained results on shape from shading, with various applications, e.g., machine inspection, remote sensing, and general digital photogrammetry. In 1985, Shafer [Shafer
The Separation of Photometry and Geometry Via Active Vision
161
1985] pursued the analysis of color, illumination, and reflectance phenomena and showed how one can use color to separate reflection components. This work inspired Lee [Lee 1992] to pursue further research on the separation of reflection components. The psychologists have had a long standing interest in this problem going back to Helmholtz [Helmholtz 1910]. We cite, Richard's work on Psychometrical Numerology [Richard 1967] and subsequently, Lightness Scale from Image Intensity Distributions, in AI Memo 648, [Richard 1981] since it comes close to our thinking in developing probability distributions of image intensities of classes of materials occurring in nature (e.g., leaves, trees, and grass). Recently, Adelson and Pentland [Adelson and Pentland 1990] have likened the visual system to a three-person workshop crew that produces theatrical sets. One person is a painter, one is an illumination expert, and one bends metal. Any luminance pattern can be produced by any of the three specialists. The painter can paint the pattern, the lighting expert can produce the pattern with variations in the illumination, and the metalworker can create the pattern by shaping the surface. This idea that the retinal image can be parsed into a set of overlapping layers in psychology goes back to 1977 in the work of Bergstrom and Gilchrist [Bergstrom and Gilchrist 1977], and later by Meteli [Meteli 1985]. Recently Gilchrist [Gilchrist 1997] developed an alternative theory of lightness based on gestalt principles. However, from a machine perception point of view, it is still important to understand the computational process of this parsing/decomposition. Physicists have had a long-standing interest in this problem and one model that is often cited is by Kubelka and Munk [Kubelka and Munk 1931]. A very influential work which developed a theoretical formulation of reflectivity that includes surface granularity and texture is due to Torrance in his PhD dissertation, later published with Sparrow in [Torrance and Sparrow 1967]. The concept of the bidirectional reflectance distribution function (BRDF) that is now standard for assessing the reflectance characteristics of a material from optical measurements is due to Nicodemus et al. [Nicodemus et al. 1977]. A great deal of work in understanding of the interaction between illumination, surfaces, and the observer has been done in the area of computer graphics, notably by the group at Cornell University led by Professor D. Greenberg and Professor K. E. Torrance." The work by He et al. [He 1991] improved the Torrance-Sparrow model with more detailed analysis of inter-reflections versus global illuminations and surface roughness, see also Arvo et al. [Arvo 1994]. Here, the researchers have asked the inverse question, that is: how one must model the illumination effects so that the generated images look realistic? 2.2 R e f l e c t a n c e P h e n o m e n a A photometric feature is constructed from image irradiance represented as:
I(x,A) =g(x)e(x,~)s(x,~)
(2.1)
162
Ruzena Bajcsy and Max Mintz
where g(x) is the geometric factor, e(x, A) is the illumination light, s(x, A) is the diffuse surface reflectance of the object, and A represents the wavelength of the light. It is the object reflectance s(x, A) from which useful object features are obtained. For diffuse reflections, the geometric factor for shading is given as g(x) = n8 9n(x), where ns is the illumination direction and n(x) is the normal of the surface projected to x = (x, y). The image irradiance Z(x, A) is influenced by the object pose with respect to illumination (g(x)) and illumination intensity and color (e(x, A)). We are looking for such measures that are invariant to geometric pose and at least semi-invariant with respect to illumination and environmental conditions. We follow here the derivation of Lee [Lee 1992]. Since the image irradiance given in Equation 2.1 includes confounded effects of geometry, illumination and surface reflectance, we take the logarithm of image irradiance to separate the multiplicative terms into additive terms. The logarithmic irradiance is given as:
L:(x, A) = lnZ(x,A) = Ing(x) + lne(x,A) +Ins(x,A).
(2.2)
The key to our approach is to investigate the use the gradients o f / : in the A direction as illumination pose- and color-invariant signatures. Since g(x) is independent of A, the effect of illumination-pose variation will be eliminated in s as: s A) - 0s e~(x, A) s~(x, A) (2.3)
0----X-- - e(x, A-mS+ s(x, A-------T" As the result of A-differentiation,/:~ consists only of normalized ex and s~, i.e., chromaticity gradients of illumination and reflectance, respectively. This means that since g(x) is removed L:x is independent of the shading change due to the illumination pose differences. However we are still left with the illumination function and the object reflectance function. The illumination function is composed of the primary and secondary illumination functions coming from inter-reflections. If the illumination function can be restricted to the primary source, and e(x, A) changes very slowly with A, then its partial derivative with respect to A is approximately zero, and this leaves us with only the terms related to object reflectance. This is all under the assumption that the surface is Lambertian. So the issue here is: how complex is the function s(x, A), and hence, how can it be approximated? This approximation question translates into: how many spectral filters must we have in order to be able to recover the photometric invariant feature of the surface patch? This, of course, will depend upon the complexity of the environment, i.e., the number of photometrically distinct surface patches in the scene. The modification of s by illumination ex/e is only additive. The collection of A-gradients at spectral locations Ak, k = 1, 2, ..., L forms an L-dimensional feature vector and it is invariant to illumination color up to the bias generated by the normalized illumination gradient ex(x, A)/e(x, A). The most notable
The Separation of Photometry and Geometry Via Active Vision
163
disadvantage of using the spectral gradients is that object color signature can be eroded when illumination color varies over the scene. For the purpose of this study, we must obtain a catalogue of classes of materials that commonly occur in the operating environment. For each material listed, we have the corresponding BRDF as a function of wavelength. This spectral information will indicate the number of filters which are required to implement the previous computations for the photometric invariants. This will, in turn, provide us with images that can be segmented in natural way by the class of expected surfaces/materials in the environment. Since the secondary illumination is a result of mutual reflection of the nearby surfaces, the working hypothesis is that the combination of these filtered images will also separate the secondary illumination and hence, reveal the body reflection via the photometric differential. Our preliminary observation is that most surfaces are only partially Lambertian with respect to the angle of illumination and their surface normals. Hence, there is need to have some estimate of surface normals. As stated above, we can use our polynocular stereo module to obtain approximate surface normals. Alternatively, we may also be able to use a differential stereo system due to Farid and Simoncelli [Farid and Simoncelli 1997] to determine the illumination and observation angles under which we can estimate the photometric invariants. The next step in this process is the determination of a set of spatial derivatives. These derivatives can be used in constructing local shape descriptors which are invariant to viewing pose. We note that until recently, illumination, color, and pose invariance have not received significant attention. The spatial derivative of L: in the x direction is: -
g(=)
e=(=, + - e(=,
+
s
(x, s(x,
9
(2.4)
When illumination color and intensity vary gradually over the small region where the spatial gradient is computed, s will be small, and thus, can be ignored. If L:= is invariant to illumination pose, we may construct a feature space based on the x- and y-derivatives of s (such as / / V x s or V 2 s at the spectral locations {)~k : 1 < k _< L}. However, g=(x)/g(x) is only conditionally invariant to illumination pose, and S~ is not invariant to general 3-D rotation and scale.
3. O u r
Segmentation
Paradigm
The main objective of understanding the photometric properties of the scene is to separate the artifacts of the illumination from the actual optical properties of the materials in the scene. A successful decomposition allows:
164
Ruzena Bajcsy and Max Mintz
- a more accurate reconstruction of the scene geometry; and - the determination of a richer description of each surface/object patch, i.e., in addition to geometric and color descriptors, we will have material descriptors which are obtained from our optical observations.
3.1 Operating Assumptions We make the following assumptions: 1. We know the classes of materials that the scene is made of. 2. For each material in the scene, we know: a) the B R D F as a function of wavelength; b) the index of refraction; and c) the polarization properties. 3. T h e primary illumination is broad-band and has a slowly-varying spectrum. 4. The surface patches are made of single class of optically detectable material. This implies that the scene is decomposable into such patches/surfaces. 5. The observer is active and mobile. Based on these assumptions, we define the following process: 1. Obtain initial estimates of surface normals using the polynocular stereo module. Identify the regions with strong high-lights. 2. Apply the set of of pairs of narrow-band spectral filters with center frequencies corresponding to the classes of materials from Assumption 1. By using filter pairs at adjacent frequencies, form a differential (difference) with respect to )~. Thus, for each class of optically different materials we shall obtain an image whose values correspond to the spectral differences, i.e., the finite-difference approximation to the first derivative. 3. For each class of materials we perform clustering/region growing which will lead to decomposition of the scene into coherent regions with respect to optically different class of materials. This, in turn, will give us a richer surface description which then can be used for more accurate matching for recovery of geometry. Remark:. This procedure based on the spectral differentiation is yet to be experimentally tested. It is the analogue to differential stereo [Farid and Simoncelli 1997] where the sampling is taking place in space rather than in )~. We believe that this interplay between these two spaces and operations on them will lead to improved photometric and geometric inferences.
4. P o l y n o c u l a r Stereo R e c o n s t r u c t i o n Our polynocular stereo system, developed over the last two years by R. Sara is composed of five cameras. There are four monochrome cameras which are
The Separation of Photometry and Geometry Via Active Vision
165
used to acquire geometry, and a fifth color camera which is used to acquire texture. The primary objective of this project is a tele-immersion application, a joint endeavor with Professor H. Fuchs from the University of North Carolina. The main concern here is high-accuracy recovery of geometry, as well as radiometric correctness of recovered surface texture. We have a hierarchy of processes which lead to: 3-D points obtained from intensity images via normalized cross correlation, sensory fusion and removal of outliers. We have obtained some promising results, see: [Sara and Bajcsy 1998], [Kamberova and Bajcsy 1998] and [Sara, Bajcsy, Kamberova, and McKendall 1998]. Here we avoided the photometric distortion by using redundant polynocular measurements combined together with robust estimation and rejection of outliers. This works if the system has redundant measurements, i.e., if more than two cameras see the same surface patch. Our hypothesis is that by Lambertizing the images first one can improve the input data set and hence, reduce the uncertainty in computing the geometry.
5. E x p e r i m e n t a l
Description
The current BRDF measurement system is shown in Figure 1. It consists of a gonioreflectometer which was built by aligning microstepper controlled three rotation stages and a goniometric cradle each with 0.001 deg. resolution. The output of a 1000 W quartz-tungsten halogen lamp is focused into a 2 in. diameter spectralon integrating sphere and then collimated. The result is an unpolarized beam with a uniform intensity profile. Reflected light from the sample is collected with an optical fiber bundle and analyzed with a multichannel spectrograph. The spectrograph consists of a monochromator with a 600 lines/mm diffraction grating and a 1024-element linear photodiode array. For a given set of observation angles (azimuth and polar), the detector can instantly obtain the spectrum of the reflected light with a resolution of 2 nm. For a given incident light direction, the BRDF of a sample for about 600 detector angle combinations (azimuth 0-360 deg. and polar 0-90 deg. with respect to sample surface normal) between 400-700 nm can be obtained within one and half hours. The data is automatically logged by the computer that controls the rotation stages and displays the results.
6. 3 - D
Reconstruction W i t h o u t Lambertization
We performed the following preliminary test. We used a flat sample of wood with some linear grain as the test object. The optical axis of the camera system was positioned perpendicular to the sample at a distance of approximately 80 ram. We varied the angle of the point source of white light illumination relative to the surface patch. Then we applied a binocular stereo
166
Ruzena Bajcsy and Max Mintz
recovery algorithm. We present two experiments which differ only in the angle of illumination. The following discussion refers to Figures 2 and 3. Figure 2 depicts the left and right image pairs and corresponding intensity histograms for both illumination conditions. Figure 3 depicts the reconstruction results with corresponding histograms of estimated depth and residuals for the planar fit for both illumination conditions. It is evident that even under such controlled conditions, there is a significant variation in the photometric effects between the image pairs. Further, when the light is positioned at certain other angles, there is significant non-Lambertian behavior, i.e., the light is reflected strongly into one of the camera pairs. This non-Lambertian behavior over virtually the entire target area totally defeats the binocular stereo reconstruction procedure.
7. C o s t / B e n e f i t
and
Sensitivity
Analyses
As this research progresses, we will develop a cost/benefit and sensitivity analyses to help ascertain whether or not it pays to attempt to separate photometry from geometry in the manner delineated above. The following issues need to be addressed: 1. How many iterations or cycles through phases 1 and 2 should be prescribed? A stopping rule needs to be determined. We note that, at best, we will only obtain self-consistent results between the photometric and geometric interpretations of the pixel data. There is no internal check for correctness. Consistency does not necessarily imply correctness. 2. There may be a serious sensitivity problem with respect to error in the inferred surface normals and in the angular position of the sources of illumination. It is conceivable that the iterated use of pseudo-random patterned light may be more cost effective in obtaining stereo matches than the photometric correction techniques. It is already known that the single (non-iterative) application of such patterned light leads to improved matching in polynocular stereo.
The Separation of Photometry and Geometry Via Active Vision
167
]BRDF MEASUREMENT SETUP
LI FIBER OPTIC BUNDLE
ANA[Y~
/
l z-~3"y
/
~ ~'~
\ \
l
! ARRAY
Fig. 7.1. The GRASP Lab BRDF Measurement System
168
Ruzena Bajcsy and Max Mintz
Fig. 7.2. Top: Image Pairs and Image Intensity Histograms for Angle of Illumination at 45 deg. Bottom: Image Pairs and Image Intensity Histograms for Angle of Illumination at 135 deg.
The Separation of Photometry and Geometry Via Active Vision
169
Fig. 7.3. Top: Reconstruction Results for Angle of Illumination at 45 deg. Bottom: Reconstruction Results for Angle of Illumination at 135 deg.
170
Ruzena Bajcsy and Max Mintz
References 1. B. K. P. Horn: The Facts of Light, MIT AI working paper, May 1975. 2. B. K. P. Horn: Image Intensity Understanding, MIT AI report #335, August 1975. 3. S. A. Shafer, "Using Color to Separate Reflection Components," COLOR Research and Application, Vol. 10, #4, pp. 210-218, 1985. 4. S. W. Lee: Understanding of Surface Reflections In Computer Vision by Color and Multiple Views, PhD Dissertation, Computer and Information Science Department, University of Pennsylvania, February 1992. 5. H. L. F. Helmholtz: Treatise on Physiological Optics, translated by J. P. Sonthall, Dover, New York, 1910. 6. W. Richards: Psycho-metrical Numerology, Tech. Engr. News, XLVIII, pp. 1117, 1967. 7. W. A. Richards: A Lightness Scale from Image Intensity Distributions, MIT AI Memo # 648, August 1981. 8. E. H. Adelson and A. P. Pentland: The Perception of shading and reflectance (Vision and Modeling technical report 140), MIT Media Laboratory, 1990. 9. S. S. Bergstrom: Common and relative components of reflected light as information about the illumination, color, and three-dimensional form of objects, Scandinavian Journal of Psychology, 18(3), pp. 180-186. 10. A. Gilchrist, C. Kossyfidi, F. Bonato, T Agostini, J. Cataliotti, X. Li, B. Spehar, and J. Szura: An Anchoring Theory of Lightness Perception taken from the web site from Alan Gilchrist, Psychology Dept., Rutgers University, Newark, NJ 07102. 11. P. Kubelka and F. Munk: Ein Beitrag zur Optik der Farbenstriche, Z. tech. Physik, vol. 12, page 593, 1931. 12. K. E. Torrance and E. M. Sparrow: Theory of off-specular reflection from roughened surfaces, Journal of Optical Society of America, Vol. 57, pp. 1105-1114, 1967. 13. X. D. He, K. E. Torrance, F. X. Sillion, D. P. Greenberg: A Comprehensive Physical Model for Light Reflection, Computer Graphics, Vol.25, No. 4, pp. 175-186, 1991. 14. J. Arvo, K. Torrance and B. Smits: A Framework for the Analysis of Error in Global Illumination Algorithms, Computer Graphics Proceedings, Annual Conference Series, pp. 75-84, 1994. 15. H. Farid, and E. Simoncelli: Range estimation by Optical Differentiation Submitted to: Journal of the Optical Society of America, September, 1997. 16. R. Sara and R. Bajcsy: Fish Scales: Representing Fuzzy Manifolds, Proceedings of the Sixth International Conference on Computer Vision, pp. 811-817, Bombay, India, January 1998. 17. R. Sara, R. Bajcsy, G. Kamberova, and R. McKendall: 3-D Data Acquisition and Interpretation for Virtual Reality and Telepresence, I E E E / A T R Workshop on Computer Vision for Virtual Reality Based Human Communication, in conjunction with ICCV'98, invited talk, Bombay, India, January 1998. 18. G. Kamberova and R. Bajcsy: Precision of 3-D Points Reconstructed from Stereo, European Conference in Computer Vision, 1998 (submitted).
Vision-Based S y s t e m Identification and State Estimation William A. Wolovich and Mustafa Unel Division of Engineering Brown University Providence, RI 02912 USA
S u m m a r y . There are many situations where a (primary) controlled system must "interact" with another (secondary) system over which it has no direct control; e.g. the robot arm of a space vehicle grasping some free-floating object, a plane being refueled in flight by a tanker aircraft, a military tank intent on identifying and destroying enemy tanks, a surgeon attempting to remove all portions of an oddshaped tumor, and a highway vehicle trying to maintain a certain speed as well as a safe distance from other vehicles. Such scenarios can involve both stationary and moving objects. In virtually all cases, however, the more knowledge that the primary system has about the secondary system, the more probable the success of the interaction. Clearly, such knowledge is very often vision-based. This paper will focus on some recent results related to both identifying what a planar object (system) is and what its static or dynamic state is, based primarily on different views of its boundary. Boundary data information has been used extensively in a wide variety of situations in pattern analysis and image understanding. While the results that we will present here also are more generally applicable in computer vision, we will focus on how' they can be applied to control system applications, and more specifically to the "visual" part of "visual-servoing."
1. I n t r o d u c t i o n The automatic identification and alignment of free-form objects is an important problem in several disciplines, including computer vision, robotics, industrial inspection and photogrammetry. W h e n such objects are in motion, the lack of specific features, such as points or lines, that can be identified easily at different times and in different locations can prevent accurate estimations of the rotational and translational velocities of the object. In such cases, we will show how sets of boundary d a t a points can be used to construct "implicit polynomial" (IP) models of the object in any given position. Such models will then imply non-visual points t h a t can be used for tracking purposes. In this paper, we present a unique "decomposition" for any b o u n d a r y curve defined by an IP equation of arbitrary degree. This decomposition is expressed as a sum of conic-line products which map to similar expressions under Euclidean transformations, which characterize planar motion. The resulting conic factor centers and the line factor intersections are shown to be useful new "related-points," which can be used to explicitly determine the
172
William A. Wolovich and Mustafa Unel
transformation matrix which defines various positions of the object at different times, hence to approximate the planar (rotational and translational) velocities of the object.
2. C o n i c - L i n e
Products
of Algebraic
Curves
To begin, we first note that an algebraic curve of degree n can be defined in the Cartesian {x, y}-plane by the implicit polynomial equation: f n ( x , y ) = aoo + alox + amy + a20x 2 + a l l x y + ao2Y2 + . . .
h0
h2(x,y)
h i ( ; , y)
n
"q-anoxnTan-l,lxn-lyT..."]- a onYn = E h r ( x , y ) = O , 9
(2.1)
t
h,(~,y)
r=0
where each binary form hr(x, y) is a homogeneous polynomial of degree r in the variables x and y. A monic polynomial of degree r in x and y will be defined by the condition that the coefficient of x r equals 1. Therefore, the f,~(x,y) defined by (2.1), as well as the leading form hn(x, y), will be monic if ano : 1. It will often be convenient to express a curve defined by (2.1) by a left to right ordered set of its coefficients, giving priority to the highest degree forms first, and the higher degree x terms next. In light of (2.1), such an ordered set of coefficients would be defined by the row vector
[ano, an-l,1,...ao,n, an-l,0,...ao,n-1, an-2,0,.., aol, aoo] r
fn(x,y)
(2.2) The substitution of m x for y in any homogeneous form hr(x, y) implies that hr(x, y = rex) = x~,~0~(m - m ~ l ) ( m - m ~ : ) . . .
(m - mrs),
m~m) for possibly complex (conjugate) roots rnri, or that h~(x, y) = a0~(y - m ~ i ~ ) ( y - m r ~ ) . . .
(y - m ~ )
Therefore, any homogeneous monic form, such as hn(x,y) in (2.1), can be uniquely factored as the product of n lines; i.e. n
n
hn(x,y) = H [ x - (1~toni)y] = H [ x + lniy], i----1 i----1
where lni de____f- 1 / r n n i
(2.3)
Since hn-1 (x, y) has n coefficients and degree n - 1, we can next determine n unique scalars knj such that
Vision-Based System Identification and State Estimation
173
h.-l(~, u) = k.l(* + I.~U)(~ + t.3U)... (~ + l..y) -[-kn2(x ~- l n l Y ) ( X "~- l,,3y) . . . (x + l , m y )
+kn3(X + l , , y ) ( x + ln2y) . . . (x + l , , y )
+ k n , ( X + l n l y ) ( x + ln2y) . . . (x + lnn--ly)
=
k.r
Ix + In~y]
(2.4)
j-~l
It then follows that (2.4) can be expressed as a system of n linear independent equations in matrix-vector form, where the unknown vector [kn, k,~2 ... k,~n] can be directly determined by a single matrix inversion. Equations (2.3) and (2.4) subsequently imply that the product n
H[x+lniy+kn,]
=hn(x,y)+hn-l(x,y)+rn-2(x,y),
(2.5)
i=l
for some "remainder" polynomial r , _ 2 ( x , y) of degree n - 2. Since the line factor x + lniy + kni can be written as the (vector) inner product !1 l,i k,~i ] X = ! x y l ! Lni,
defSr
%fX r
l.Jni
(2.1) and (2.5) imply that any monic
1.(., y) = f i L mr X
+fn-2(x,y)
(2.6)
i----1 def
= / / . ( . , y)
for the n - 2 degree polynomial n--2
fn-2(x,y) = Z
hi(x,y) - rn-2(x,y)
i=O
If ln4 a n d / o r kni are complex numbers, with complex conjugates defined by l~,i and k~i, respectively, then x + l~iy + k~i = X T L n i also will appear as a line factor in (2.6). Any two such complex conjugate line factors will imply a corresponding real, degenerate* conic factor Cni(X,y) def * * = X r L n. i L nTi X = X2 + (Ini + lni)XY + lnil~iy 2 + (k,-,i + kni)X x Since C,,~(x, y) can be factored as the product of two lines.
174
William A. Wolovich and Mustafa Unel
+ (lnik*~ + Imkni)y + knik~i
(2.7)
Therefore, a total of 2p < n complex (conjugate) values for l , u o r k ~ will imply that IIn(x, y) in (2.6) can be expressed by the unique, real conic-line
product
p
n--2p
n.(x,y) = H c.k(x,y) H L.JX k=l
(2.8)
j----1
We next note that if 7n-2 is the coefficient of x ~-2 in the fn-2(x, y) defined by (2.6), then a monic IIn_2(x, y) can be defined for fn-2(x, Y)/7,-2, as above, so that
In-2 (x, y) -~- "/.-2 [/'/n-2 (x, y) 4- In-4 (x, y)] Subsequently defining 7=-4 as the coefficient of fn-4(x,y), etc., it follows that any monic fn(x, y) has a unique c o n i c - l i n e d e c o m p o s i t i o n , namely
fn(x,y) = 1-In(x,y) "4-"[n-2[1-[n-2(x,y) q-'Tn-4[iYIn-4(x,y) "4-...]]
(2.9)
We finally remark that in the case of closed and bounded quartic (4th degree) curves, our conic-line decomposition implies t h a t f4(x, y) will factor as the product of two conics plus a third conic, or that
f4(x,y) ----Cal(x,y)Ca2(x,y) + C2o(x,y)
(2.10)
3. Line Factor I n t e r s e c t i o n s and Conic Factor Centers The intersection point dp = {xp, yp} of any two real, non-parallel line factors defined in (2.9), such as L T x = x + lijy + kij and L T r x = x + lqry + kq~, can be defined by the matrix/vector relation
1 lqv kqrJ
Ylp
(3.1)
Moreover, the center dc = {xc, y~} of arty conic factor C,~(x,y) in (2.9), as defined by (2.7) when m = n, can be defined by the matrix/vector relation[4]
lmi +lm i
21milani
mikm i + lrnikm ' j
(3.2)
Vision-Based System Identification and State Estimation
175
4. Euclidean M a p p i n g s of R e l a t e d P o i n t s To estimate planar object motion, one can employ the Euclidean transformation matrix E which relates the boundary data points of a planar object at two different positions. Such an E is defined by both a rotation M and a linear translation P; i.e. [y] ____[COSt? --sin/?] p, [sint? cost? J [ y ] + [ p y ]
=
[cos0 sin~ [i] si; 0
(4.1)
cos 0 o
x
~
2
The mathematical relationship defined by (4.1) will be abbreviated as X E E)~, where M is an orthogonal (rotation) matrix, so that M T M = M M T = I. In general, any two n-th degree curves which outline the boundary of the same object in two different positions can be defined by a monic f,~(x, y) = 0 and a monic fn(2, ~) = 0 will be Euclidean equivalent, in the sense that fn(x, y) = 0 ~> fn(cos t? E-sin t? ~+p~, sin t? ~+cos t? ~+py)
de f
8 n ] n ( ~ ' ~) = 0
(4.2) for some scalar sn. Two corresponding related-points 2 of any two Euclidean equivalent curves defined by f n ( x , y )
= 0 and ] n ( e , ~ )
= 0, such as d~ = { x ~ , y d and d~ =
{2i, ~i}, respectively, will be defined by the condition that [x:]
=
[coit? si 0
-sint? cost? 0
p~] [2: ] p~ 1
(4.3)
Therefore any three corresponding related-points will define the Euclidean transformation matrix E via the relation: Y2 Y3 = E 1 1
de2 T
Y2 1
E = TT -1
de ; f
2 Which are analogous to the so-called interest points defined in [3].
(4.4)
176
William A. Wolovich and Mustafa Unel
5. C o n i c - L i n e
Euclidean
Transformations
Under a Euclidean transformation E, (4.1) and (2.6) will imply that every q
q
q
i=1
i=1
i=l
q -T LqiX II i=l
def
= 8qi
def~___/ I q ( e ,
Y)
(5.1) q -T for a real scalar sq = I-[4=1 Sq{ and q monic line factors LqiX , for q = n, n 2, n - 4 , ... , which will imply real conic factors when they appear in complex conjugate pairs. Therefore, in light of (2.9), the mapping
9) +...]],
9) +
will define a unique monic polynomial that is Euclidean equivalent to fn (x, y), namely s
= H - ( ~ , 9 ) + ~ - 2 s,-2[,ffn-2(~,9) + ~ - 4 s ~ _ 2 [H~-~(~,9) + . - . ] ] 8n ~n--2
8n--4 ~n--4
(5.2) Each Hq(x, y) of fn( x, y), and each corresponding Hq(~, Y) of a Euclidean equivalent fn(X, 9), will have the same number of real conic factors and real line factors, as defined by (2.8). Moreover, (5.1) implies that all of these factors will map to one another under Euclidean transformations. Therefore, fn(x,y) and fn(~,9) will have the same number of corresponding relatedpoints, as defined by the centers of their corresponding conic factors and all possible intersections of their corresponding line factors. Moreover, as shown in [6], all of these corresponding related-points will map to one another under
any Euclidean transformation. In the special case of closed and bounded quartic curves, where (2.10) holds, the Euclidean equivalence of two curves defined by complete sets of data points will imply that y,(x,y) ~ A(~,9) = C,~(~,9)C,~(~,9 ) + r
(5.3)
Therefore, the four conic factor centers of Cai(x, y) and Cai(~, Y), for i = 1 and 2, as well as the two conic factor centers of C2o(X, y) and C2o(~, Y), all of which map to one another under E, can be used to determine E via (4.4), as we will later illustrate.
Vision-Based System Identification and State Estimation
177
6. Absolute Euclidean Invariants for System Identification In light of (4.2), any number k of corresponding related-points of the Euclidean equivalent IPs f,~(x, y) and ] n ( x , Y) will satisfy the relation zi
def =
fn(di) = s n f n ( d i ) def = sn2i for i = 1 , 2 , . . . , k
(6.1)
As a consequence, (6.2)
E i = I Zi
Ei=I To establish the correct correspondence for the k related-points in any two corresponding conic-line products, we will order the di such t h a t Zl < z2 < . . . < zp, and 21<22 <
...
< 2p if sn > O,
and
~a > z2
> 2p if sn < O
...
(6.3) If we use absolute values in (6.1) and (6.2), to insure uniqueness, it follows that any related-point ratio defined by k
Ei=I
k
Izil
~i=1
I~',:1
_
h,
(6.4)
which is independent of the ordering (correspondence) of the related-points, will be an absolute Euclidean invariant[8] of the Euclidean equivalent curves. Different invariants can be defined via (6.4) for different combinations of k corresponding related-points, and these invariants can be used for s y s t e m identification; i.e. to identify the particular object being tracked.
7. Boundary Data Set 3L Fitting Now suppose we have two sets of d a t a points t h a t describe the entire boundary of the same object in two different positions, which imply two Euclidean equivalent curves. To use our conic-line decomposition in such cases, one must first fit an IP curve to the d a t a sets. The 3L fitting algorithm[l, 2] will be used for this purpose here. Basically, 3L fitting is explicit linear least squares fitting t h a t is implemented by augmenting each d a t a point (of a d a t a set) by a pair of synthetically generated points at an equal distance to either side of the d a t a point in a direction perpendicular to the d a t a curve. An explicit polynomial is then fit to the entire d a t a set, where the values assigned to the synthetic d a t a points are +c or - c , for an arbitrary scalar c, depending on whether the points
178
William A. Wolovich and Mustafa Unel
are inside or outside the data set, respectively. T h e original data points are assigned a value of 0. Figure 7.1 depicts a 10th degree IP curve, obtained using 3L fitting, which outlines the boundary of a IR tank image.
Fig. 7.1. A 10th Degree IP Fit of an IR Tank Image 3L fitting is Euclidean invariant, numerically stable, and fast, when compared to more traditional least-square fitting algorithms. This section will illustrate how our conic-line decomposition and 3L fitting can be used to determine the Euclidean transformation matrix which relates two different views of the same planar object that is initially defined by its boundary data sets.
Example. Figure 7.2 depicts two data sets (the 500 point "solid" curves) which represent the outline of an airplane in two different positions, the lower one at time to and the upper one at some time t / > to. To estimate the velocity of the plane between these two positions, we we first apply the 3L fitting algorithm of [1] to b o t h data sets to obtain 4th degree implicit polynomials 3 whose zero sets are depicted by the bolder curves in Figure 7.2. The upper quartic f4(x, y) = 0 is defined by the (monic) row vector [1, 1.622, 1.038, -59.896, 111.137, -0.0061, -5.737, -19.026, 3 In many practical situations, quartic IP curves are general enough for accurate E determination, even though they do not accurately fit the data set everywhere, such as at the airplane tail sections.
Vision-Based System Identification and State Estimation
179
42.731, -2.301, 3.715, 14.17, 0.802, 3.79, --0.204], and the lower (Euclidean equivalent) quartic ]4(~, ~) = 0 is defined by the row vector [1, -1.759, 1.049, -0.211, 0.0213, -9.928, 12.69, -4.592, 0.476, 36.632, -29.641, 5.012, -59.214, 22.589, 35.321]
-1
-2
-3
-4
-53
I
I
I
I
I
I
-2
-1
0
1
2
3
4
Fig. 7.2. Two Superimposed Airplane Data Sets
Our conic-line decomposition of f 4 ( x , y ) = 0 then implies that C41(x,y)
= x2 +
5.915xy + 21.183y 2 + 0.956x + 2.709y + 0.229,
and that C4~(x, y) = x 2 -
4.293xy + 5.247y 2 - 0.962x + 1.346y + 0.433,
with centers at dl = {-0.492, 0.0047} and d2 -- {1.685, 0.561}, respectively, so ordered using (6.3) because s > 04 and zi = f4(dl) = -1.094 < z2 = f4(d2) = -0.496. 4 As we will later show via (7.1).
180
William A. Wolovich and Mustafa Unel T h e center of
C2o(x, y)
= - 2 . 0 4 3 x 2 + 3.455xy + 0.155y 2 + 0.608x + 2.31y - 0.303
is at d3 = {x3,Y3} = { - 0 . 5 9 , - 0 . 8 7 4 } , where z3 = f4(d3) = 31.687. An analogous conic-line d e c o m p o s i t i o n of ]4(e, if) = 0 implies t h a t C4t (~, Y) = e2 _ 1.505s
+ 0.634~32 - 6.148~ + 4.955~ + 9.847,
and t h a t C42(~, ff)~2 _ 0.253eff + 0.0336ff 2 - 3.779~ + 0.488~ + 3.572, with centers at dl --: {1.25, - 2 . 4 2 4 } a n d d2 = {1.857, - 0 . 2 6 } , respectively, so ordered because 21 = f4(dl) = - 0 . 0 1 3 3 < 22 = f4(d2) = -0.0058. T h e center of C'2o(~, Y) = - 0 . 0 2 3 ~ 2 - 0.0443~3 - 0.0401~ + 0.087g + 0.149 is at d3 = {e3,Y3} = {1.962, - 2 . 9 4 8 } , where ~'3 = f4(d3) = 0.386, so t h a t in light of (6.2), zl + z2 + z3 30.097 s4 - _ - - - 82.01 > 0 (7.1.) zl + z2 + 23 0.367 We m i g h t note at this point t h a t the I P s which define t h e two outlines i m p l y nearly identical ( k = 3 ) a b s o l u t e Euclidean invariants, as defined by (6.4), n a m e l y
/3 =
Ei~l Izil1/3 3
3
= 12.893 ~ 13.07 =
[ YI~112,1 ]1/3
Using our conic factor centers, (4.4) next implies t h a t
E = T T -1 =
04921685 059111j125 157 711 0004 1
=-00866005
1
0 00
1
.jL.
0.982 ] 2.2199
=
1
r cos 0
- sin 0
si00
cos00
'
so t h a t 0 = - 6 0 ~ T h e a n g u l a r velocity of t h e airplane is therefore given b y -6O~ - to). To d e t e r m i n e the t r a n s l a t i o n a l velocities in the x a n d y directions, we note t h a t (4.3) relates the center of masses dc = (xc, Yc} and d c = ( i t , ~ } of the
Vision-Based System Identification and State Estimation
181
two outlines. Since the center of mass of the upper outline, de = {Xc, Yc} {0, 0}, the inverse of (4.3) implies that
Therefore, the (center of mass) translational velocities from the lower outline to the upper outline are approximately given by vx = - 1 . 5 / ( t I - to) and vu = 2.0/(tf - to), in appropriate dimensional units. Of course, one could have several views (boundary data sets) of the airplane between the initial and final times, to and tf. In such cases, knowledge of the center of mass of any one set of boundary data points will imply the center of mass of all other boundary data sets via (4.3), once the IPs and the Euclidean transformations are determined. These additional views would clearly imply more accurate velocity approximations at the interim times between to and tf.
8. Concluding Remarks We have now outlined a vision-based procedure for identifying objects from a set of boundary data points, and then estimating the (state) velocities of the object from a sequence of images of the object at different times in different positions. In particular, we defined and illustrated a new, unique conic-line decomposition of any 2D curve defined by an IP equation of arbitrary degree n. In the special case of closed and bounded quartic curves, our decomposition implies the product of two conics plus a third conic. The conic factor centers and the line factor intersections of any IP decomposition represent useful related-points which map to one another under Euclidean transformations. Such transformations relate different positional views of the same object. Our related-points also directly define absolute Euclidean invariants, which can be used for object identification. Although the example that we presented used only two positions, it follows that any additional positional views (boundary d a t a sets) between times to and t f would imply more precise velocity approximations.
References 1. Blane, M. M., Z. Lei and D. B. Cooper, The 3L Algorithm for Fitting Implicit Polynomial Curves and Surfaces to Data, IEEE Transactions on Pattern Analysis and Machine Intelligence (under review), 1996. 2. Lei, Z., M. M. Blane, and D. B. Cooper, "3L Fitting of Higher Degree Implicit Polynomials," Proceedings of Third IEEE Workshop on Applications of Computer Vision, Sarasota, FL, December 1996.
182
William A. Wolovich and Mustafa Unel
3. Mundy, Joseph L. and Andrew Zisserman, G e o m e t r i c I n v a r i a n c e in C o m p u t e r Vision, The MIT Press, 1992. 4. Selby, Samuel M., CRC Standard Mathematical Tables, The Chemical Rubber Company, Seventeenth Edition, 1969. 5. Wolovich, William A., R o b o t i c s : Basic A n a l y s i s a n d Design, Holt, Rinehart and Winston, 1987. 6. Wolovich, William A. and Mustafa Unel, "The Determination of Implicit Polynomial Canonical Curves," To appear in the IEEE PAMI, 1998. 7. Unel, Mustafa and William A. Wolovich, "A Unique Decomposition of Algebraic Curves," Technical Note LEMS-166. Also submitted for possible publication in the International Journal of Computer Vision, September, 1997. 8. Wolovich, William A. and Mustafa Unel, "Absolute Invariants and Affine Transformations for Algebraic Curves." Submitted for possible publication to Computer Vision and Image Understanding, October, 1997.
Visual Tracking, Active Vision, and Gradient Flows Allen Tannenbaum and Anthony Yezzi, Jr. Department of Electrical and Computer Engineering University of Minnesota Minneapolis, MN 55455 USA Summary. In this note, we discuss the minimization of certain functionals and the resulting gradient flows for problems in active vision. In particular, we consider how these techniques may be applied to deformable contours, and Ll-based methods for optical flow. Such techniques are becoming essential tools in the emerging field of controlled active vision.
1. I n t r o d u c t i o n In this note, we will discuss some of the work that we have been conducting on the development of a new approach for employing image-based feedback in control systems [13, 37, 17, 18, 29, 38]. We will concentrate on the vision and image processing aspects of our work, in particular the use of certain gradient geometric curvature drive flows. In addition to the control applications, these techniques have been applied in medical imaging, as well as shape and object recognition problems. A central research area at the interface of control and computer vision is that of visual tracking which may be used for a number of problems in manufacturing, robotics, and automatic target recognition. Even though tracking in the presence of a disturbance is a classical control issue, because of the highly uncertain nature of the disturbance, this type of problem is very difficult and challenging. Visual tracking differs from standard tracking problems in that the feedback signal is measured using imaging sensors. In particular, it has to be extracted via computer vision and image processing algorithms and interpreted by a reasoning algorithm before being used in the control loop. Furthermore, the response speed is a crucial aspect. We have been developing robust control algorithms for some years now, valid for general classes of distributed parameter and nonlinear systems based on interpolation and operator theoretic methods; see [8] and the references therein. In this paper, we will indicate how such control techniques may be combined with the gradient flows in image processing in order to develop novel visual tracking algorithms. Because of our interest in controlled active vision, we have been conducting research into advanced algorithms in image processing and computer vision for a variety of uses: image smoothing and enhancement, image segmentation, morphology, denoising algorithms, shape recognition, edge detection, optical flow, shape-from-shading, and deformable contours ("snakes");
184
Allen Tannenbaum and Anthony Yezzi, Jr.
see [13, 37, 17, 18, 29] and the references therein. Our ideas are motivated by certain types of energy functionals which lead to geometric invariant flows. These in turn are based on the mathematical theory of curve and surface evolution. One has powerful numerical algorithms based on Hamilton-Jacobi type equations and the associated theory of viscosity solutions for the computer implementation of this methodology [24, 32, 33]. An important method in active vision and tracking is that of deformable contours or snakes. The work we discuss here is based on [13, 37]. Snakes are autonomous processes which employ image coherence in order to track features of interest over time. In the past few years, a number of approaches have been proposed for the problem of snakes. The underlying principle in these works is based upon the utilization of deformable contours which conform to various object shapes and motions. Snakes have been used for edge and curve detection, segmentation, shape modeling, and especially for visual tracking. We have developed a new deformable contour model which is derived from a generalization of Euclidean curve shortening evolution. Our snake model is based on the technique of multiplying the Euclidean arc-length by a function tailored to the features of interest to which we want to flow, and then writing down the resulting gradient evolution equations. Mathematically, this amounts to defining a new Riemannian metric in the plane intrinsically determined by the geometry of the given image, and then computing the corresponding gradient flow. This leads to some interesting new snake models which efficiently attract the given active contour to the desired feature (which is regarded as lying at the b o t t o m of a potential well). The method also allows us to naturally write down 3-D active surface models as well. Our model can handle multiple contours as well as topological changes such as merging and breaking which classical snakes cannot. This will be one of the key ideas which we employ in visual tracking. See Section 2.. A number of the algorithms we have developed are based on ideas in optimal control and the corresponding gradient flows, especially our work on the estimation of optical flow and stereo disparity [17, 18]. Indeed, let us consider in some detail optical flow. The computation of optical flow has proved to be an important tool for problems arising in active vision, including visual tracking. The optical flow field is defined as the velocity vector field of apparent motion of brightness patterns in a sequence of images. It is assumed that the motion of the brightness patterns is the result of relative motion, large enough to register a change in the spatial distribution of intensities on the images. We are now exploring various constrained optimization approaches for the purpose of accurately computing optical flow. In this paper, we apply an L 1 type minimization technique to this problem following [17]. See Section 3.. These ideas make strong contact with viscosity theory for Hamilton-Jacobi equations from optimal control. Similar ideas are used in [18] for an L 1 optimization approach to stereo disparity.
Visual Tracking, Active Vision, and Gradient Flows 2. G e o m e t r i c
Snakes
and
Gradient
185
Flows
In this section, we will describe a new approach for snakes or active contours based on principles from geometric optimization theory. We follow [13, 37]. (See [7, 34] for related approaches.) Active contours may be regarded as autonomous processes which employ image coherence in order to track various features of interest over time. Such deformable contours have the ability to conform to various object shapes and motions. Snakes have been utilized for segmentation, edge detection, shape modeling, and visual tracking. Active contours have also been widely applied for various applications in medical imaging. For example, snakes have been employed for the segmentation of myocardial heart boundaries as a prerequisite from which such vital information such as ejection-fraction ratio, heart output, and ventricular volume ratio can be computed. The recent book by Blake and Yuille [5] contains an extensive collection of papers on the theory and practice of deformable contours together with a large list of references. In this paper, we consider an approach based on length minimization. We should note t h a t in some more recent work we have considered an area based minimization m e t h o d with some encouraging results; see [35] for all the details. In the classical theory of snakes, one considers energy minimization methods where controlled continuity splines are allowed to move under the influence of external image dependent forces, internal forces, and certain constraints set by the user. As is well-known there may be a number of problems associated with this approach such as initializations, existence of multiple minima, and the selection of the elasticity parameters. Moreover, natural criteria for the splitting and merging of contours (or for the t r e a t m e n t of multiple contours) are not readily available in this framework. In [13], we have described a new deformable contour model to successfully solve such problems, and which will become one of our key techniques for tracking. Our method is based on the Euclidean curve shortening evolution which defines the gradient direction in which a given curve is shrinking as fast as possible relative to Euclidean arc-length, and on the theory of conformal metrics. Namely, we multiply the Euclidean arc-length by a function tailored to the features of interest which we want to extract, and then we compute the corresponding gradient evolution equations. The features which we want to capture therefore lie at the b o t t o m of a potential well to which the initial contour will flow. Further, our model may be easily extended to extract 3D contours based on motion by mean curvature [13, 37]. Let us briefly review some of the details from [13]. (A similar approach was formulated in [7].) First of all, in [6] and [19], a snake model based on the level set formulation of the Euclidean curve shortening equation is proposed. The model which they propose is is -~ = r
(
) + v).
(2.1)
186
Allen Tannenbaum and Anthony Yezzi, Jr.
Here the function r y) depends on the given image and is used as a "stopping term." For example, the term r y) may chosen to be small near an edge, and so acts to stop the evolution when the contour gets close to an edge. One may take [6, 19] 1 r := 1 + []VG~ */[[2,
(2.2)
where I is the (grey-scale) image and G~ is a Gaussian (smoothing filter) filter. The function k~(x,y,t) evolves in (2.1) according to the associated level set flow for planar curve evolution in the normal direction with speed a function of curvature which was introduced in [24, 32, 33]. It is important to note that the Euclidean curve shortening part of this evolution, namely Ok~ . Vk~ (2.3)
= IlV~lldlv( i1-~11 )
is derived as a gradient flow for shrinking the perimeter as quickly as possible. As is explained in [6], the constant inflation term u is added in (2.1) in order to keep the evolution moving in the proper direction. Note that we are taking k~ to be negative in the interior and positive in the exterior of the zero level set. We would like to modify the model (2.1) in a manner suggested by Euclidean curve shortening [9]. Namely, we will change the ordinary Euclidean arc-length function along a curve C = (x(p),y(p)) T with parameter p given by to
dsr = (X2p+ y2)l/2r where r y) is a positive differentiable function. Then we want to compute the corresponding gradient flow for shortening length relative to the new metric dsr Accordingly set Lr Let
:=
~01II-~pllCdp. 0C
oc .lloc z:=N/Nil,
denote the unit tangent. Then taking the first variation of the modified length function Lr and using integration by parts (see [13]), we get that
fL,(t) OC L'c(t) = - J0 (-~-, r
- (Vr
which means that the direction in which the Lr perimeter is shrinking as fast as possible is given by
Visual Tracking, Active Vision, and Gradient Flows
OC
0---t = (Ca - (Vr A/'))Af.
187 (2.4)
This is precisely the gradient flow corresponding to the minimization of the length functional Lr The level set version of this is 0--{
= r
+ Vr
V~.
(2.5)
One expects that this evolution should attract the contour very quickly to the feature which lies at the bottom of the potential well described by the gradient flow (2.5). As in [6, 19], we may also add a constant inflation term, and so derive a modified model of (2.1) given by 0~
.
= r
V~
+ ~) + V r
VO.
(2.6)
Notice that for r as in (2.2), Vr will look like a doublet near an edge. Of course, one may choose other candidates for r in order to pick out other features. We now have very fast implementations of these snake algorithms based on level set methods [24, 32]. Clearly, the ability of our snakes to change topology, and quickly capture the desired features will make them an indispensable tool for our visual tracking algorithms. Finally, we have also developed 3D active contour evolvers for image segmentation, shape modeling, and edge detection based on both snakes (inward deformations) and bubbles (outward deformations) in our work [13, 37]. Remark.
In [35], we consider a modified area functional of the form Ar
= - 2 1 ~0L(t) r
Af)ds = - 2 1~01 r
(-YP~)dp, Xp ]
(2.7)
which leads to the gradient flow
Ct = { r + I (c, vr } Af.
(2.8)
A hybrid snake model combining the length and area minimizing flows is also proposed. Notice that since the area flow only involves first order terms, it may converge more quickly to the desired edge than a length minimizing one.
188
AllenTannenbaum and Anthony Yezzi, Jr.
3. O p t i c a l
Flow
Optical flow has proved to be an important tool for problems arising in active vision. In this section, we discuss an approach from [17] for reliably computing optical flow. The optical flow field is the velocity vector field of apparent motion of brightness patterns in a sequence of images [12]. One assumes that the motion of the brightness patterns is the result of relative motion, large enough to register a change in the spatial distribution of intensities on the images. Thus, relative motion between an object and a camera can give rise to optical flow. Similarly, relative motion among objects in a scene being imaged by a static camera can give rise to optical flow. In [17], we consider a spatiotemporal differentiation method for optical flow. Even though in such an approach, the optical flow typically estimates only the isobrightness contours, it has been observed that if the motion gives rise to sufficiently large intensity gradients in the images, then the optical flow field can be used as an approximation to the real velocity field and the computed optical flow can be used reliably in the solutions of a large number of problems; see [11] and the references therein. Thus, optical flow computations have been used quite successfully in problems of three-dimensional object reconstruction, and in three-dimensionai scene analysis for computing information such as depth and surface orientation. In object tracking and robot navigation, optical flow has been used to track targets of interest. Discontinuities in optical flow have proved an important tool in approaching the problem of image segmentation. The problem of computing optical flow is ill-posed in the sense of Hadamard. Well-posedness has to be imposed by assuming suitable a priori knowledge [26]. In [17], we employ a variational formulation for imposing such a priori knowledge. One constraint which has often been used in the literature is the "optical flow constraint" (OFC). The OFC is a result of the simplifying assumption of constancy of the intensity, E = E ( x , y, t), at any point in the image [12]. It can be expressed as the following linear equation in the unknown variables u and v E~u + Evv + Et = 0, (3.1) where E~, E v and Et are the intensity gradients in the x, y, and the temporal directions respectively, and u and v are the x and y velocity components of the apparent motion of brightness patterns in the images, respectively. It has been shown that the OFC holds provided the scene has Lambertian surfaces and is illuminated by either a uniform or an isotropic light source, the 3-D motion is translational, the optical system is calibrated and the patterns in the scene are locally rigid; see [4]. It is not difficult to see from equation (3.1) that computation of optical flow is unique only up to computation of the flow along the intensity gradient V E = (E~, Ev) T at a point in the image [12]. (The superscript T denotes
Visual Tracking, Active Vision, and Gradient Flows
189
"transpose.") This is the celebrated aperture problem. One way of treating the aperture problem is through the use of regularization in computation of optical flow, and consequently the choice of an appropriate constraint. A natural choice for such a constraint is the imposition of some measure of consistency on the flow vectors situated close to one another on the image. Horn and Schunk [12] use a quadratic smoothness constraint. The immediate difficulty with this method is that at the object boundaries, where it is natural to expect discontinuities in the flow, such a smoothness constraint will have difficulty capturing the optical flow. For instance, in the case of a quadratic constraint in the form of the square of the norm of the gradient of the optical flow field [12], the Euler-Lagrange (partial) differential equations for the velocity components turn out to be linear elliptic. The corresponding parabolic equations therefore have a linear diffusive nature, and tend to blur the edges of a given image. In the past, work has been done to try to suppress such a constraint in directions orthogonal to the occluding boundaries in an effort to capture discontinuities in image intensities that arise on the edges; see [21] and the references therein. We have proposed in [17] a novel method for computing optical flow based on the theory of the evolution of curves and surfaces. The approach employs an L 1 type minimization of the norm of the gradient of the optical flow vector rather than quadratic minimization as has been undertaken in most previous regularization approaches. This type of approach has already been applied to derive a powerful denoising algorithm [28]. The equations that arise are nonlinear degenerate parabolic equations. The equations diffuse in a direction orthogonal to the intensity gradients, i.e., in a direction along the edges. This results in the edges being preserved. The equations can be solved by following a methodology very similar to the evolution of curves based on the work in [24, 33]. Proper numerical implementation of the equations leads to solutions which incorporate the nature of the discontinuities in image intensities into the optical flow. A high level algorithm for our m e t h o d may be described as follows: 1. Let E = E(x, y, t) be the intensity of the given moving image. Assume constancy of intensity at any point in the image, i.e.,
E~u + Eyv + Et = O, where
dx
dy
are the components of the apparent motion of brightness patterns in the image which we want to estimate. 2. Consider the regularization of optical flow using the L 1 cost functional (~,~)
where r is the smoothness parameter.
190
Allen Tannenbaum and Anthony Yezzi, Jr.
3. The corresponding Euter-Lagrange equations may be computed to be
~, - a ~ E ~ ( E x u + E y v + E t ) = O, av - c~2Ex(E~u + E ~ v + E t ) = O,
where the curvature 9
~Tu
and similarly for ~v. 4. These equations are solved via "gradient descent" by introducing the system of nonlinear parabolic equations at, = ~
- a2E~(Ex~ + Ey~ + Et),
% = ~
-
a2E~(Ejz + E ~ + Et),
for 5 = 5(x,y, tl), and similarly for ~3. The above equations have a significant advantage over the classical HornSchunck quadratic optimization method since they do not blur edges. Indeed, the diffusion equation ~t = A~
1 iiV~ll~-{V2~(V~), V~),
does not diffuse in the direction of the gradient V~; see [2]. Our optical flow equations are perturbations of the following type of equation: 9 t -
~___L%I I V ~ l l . IIV~ll
Since [IVqSII is maximal at an edge, our optical flow equations do indeed preserve the edges. Thus the Ll-norm optimization procedure allows us to retain edges in the computation of the optical flow. This approach to the estimation of motion will be one of the tools which we will employ in our tracking algorithms. The algorithm has already proven to be very reliable for various type of imagery [17]. We have also applied a similar technique to the problem of stereo disparity; [18]. We also want to apply some of the techniques of Vogel [36] in order to significantly speed up our optical flow algorithms9
Visual Tracking, Active Vision, and Gradient Flows 4. A n E x a m p l e : Project
Optical
Feedback
for Airborne
191
Laser
In this section, we will briefly describe a project in which we are involved which employs some of the preceding ideas. The purpose of this effort is to study a benchmark control and image processing problem concerned with tracking a high speed projectile. We want to explore the unique features of the problem and develop a feasibility study as to whether current day technology/algorithms are capable of addressing the program. The initial phase of the problem focuses on images generated with wave optics simulation of missiles for the airborne laser (ABL) project. The data is typically in gray-level format and represents a sequence of images of the target. Atmospheric effects and diffraction are assumed to be the main sources of noise. Background as well as detector noise will be added in subsequent phases. The images are provided at 100 #sec intervals and are 200 pixels square. The objective of the tracker is to accurately locate the nose of the missile and to maintain a line of sight reference to it at about 1KHz bandwidth. A brief outline of our approach is as follows. A 4-degree of freedom rectangular frame (2-degrees for position, 1-degree for orientation, plus 1-degree for size/distance to the target) is placed around the distorted image of the missile to be used for tracking. Due to atmospheric disturbance the image of the missile appears to be very noisy. Using our nonlinear denoising and enhancement algorithms [14, 25, 30], we produce a filtered image. We then apply the geometric active contours ("snakes") algorithm described above in order to extract a contour for tracking. The nonlinear optical-flow algorithm, based on Ll-optimization described in the previous section, is used to determine the motion of the resulting image relative to the frame and correcting action should be taken to re-position the frame on the next image. More precisely, this algorithm is applied to alternate images re-positioned by the feedback loop. The "error" signal driving the feedback loop can be generated as follows. We compare alternate images, one corresponding to the (smoothed) input of the curren.t time-instant and the other, corresponding to the previous point in time, suitably repositioned according to our best estimate of position and movement of the projectile. Statistical and averaging methods will be used to detect the relative motion from a needle diagram of the optical-flow, and will generate an error signal reflecting the need of suitable corrective action on the positioning of the bounding frame.
5. C o n c l u s i o n s The problem of visual tracking lies at the interface between control and active vision, and is the template for the whole question of how to use visual infor-
192
Allen Tannenbaum and Anthony Yezzi, Jr.
mation in a feedback loop. To solve such problems the new area of controlled active vision has been created in the past few years. Historically, control theory and computer vision have emerged as two separate research disciplines employing techniques from systems, computer science, and applied mathematics. Recently, new challenges have appeared in robotics, advanced weapons systems, intelligent highways, and other key technologies, that signal the need to bring these disciplines together. This is the purpose of the new research area of controlled active vision which represents the natural unification of vision and feedback. Vision is the key sensory modality for mediating interaction with the physical world, and control has developed powerful methodologies about feedback and the reduction of uncertainty, spurred on by the pioneering work of George Zames. We have reached the exciting time when much of the most interesting work in both fields will consist of this emerging control/vision synthesis. Acknowledgement. This work was supported in part by grants from the National Science Foundation ECS-99700588, NSF-LIS, by the Air Force Office of Scientific Research AF/F49620-94-1-00S8DEF, AF/F49620-94-1-0461, AF/F49620-981-0168, by the Army Research Office DAAL03-92-G-0115, DAAH04-94-G-0054, DAAH04-93-G-0332, and MURI Grant.
References 1. L. Alvarez, F. Guichard, P. L. Lions, and J. M. Morel, "Axioms and fundamental equations of image processing," Arch. Rational Mechanics 123 (1993), pp. 200257. 2. L. Alvarez, P. L. Lions, and J. M. Morel, "Image selective smoothing and edge detection by nonlinear diffusion," S I A M J. Numer. Anal. 29 (1992), pp. 845866. 3. J. L. Barron, D. J. Fleet, and S. S. Beauchemin, "Performance of optical flow techniques," International Journal of Computer Vision, 12:43-77, 1994. 4. A. D. Bimbo, P. Nesi, and J. L. C. Sanz, "Optical flow computation using extended constraints," Technical report, Dept. of Systems and Informatics, University of Florence, 1992. 5. A. Blake, R. Curwen, and A. Zisserman, "A framework for spatio-temporal control in the tracking of visual contours," to appear in Int. J. Compter Vision. 6. V. Casselles, F. Catte, T. Coil, and F. Dibos, "A geomtric model for active contours in image processing," Numerische Mathematik 66 (1993), pp. 1-31. 7. V. Caselles, R. Kimmel, and G. Sapiro, "Geodesic snakes," to appear in Int. J. Computer Vision. 8. C. Foias, H. Ozbay, and A. Tannenbaum, Robust Control of Distributed Parameter Systems, Lecture Notes in Control and Information Sciences 209, SpringerVerlag, New York, 1995. ' 9. M. Grayson, "The heat equation shrinks embedded plane curves to round points," J. Di~erential Geometry 26 (1987), pp. 285-314. 10. E. C. Hildreth, "Computations underlying the measurement of visual motion," Artificial Intelligence, 23:309-354, 1984. 11. B. K. P. Horn, Robot Vision, MIT Press, Cambridge, Mass., 1986.
Visual Tracking, Active Vision, and Gradient Flows
193
12. B. K. P. Horn and B. G. Schunck, "Determining optical flow," Artificial Intelligence, 23:185-203, 1981. 13. S. Kichenassamy, A. Kumax, P. Olver, A. Tannenbaum, and A. Yezzi, "Conformal curvature flows: from phase transitions to active vision," Archive of Rational Mechanics and Analysis 134 (1996), pp. 275-301. 14. B. B. Kimia, A. Tannenbaum, and S. W. Zucker, "Shapes, shocks, and deformations, I," Int. J. Computer Vision 15 (1995), pp. 189-224. 15. B. B. Kimia, A. Tannenbaum, and S. W. Zucker, "On the evolution of curves via a function of curvature, I: the classical case," J. of Math. Analysis and Applications 163 (1992), pp. 438-458. 16. B. B. Kimia, A. Tannenbaum, and S. W. Zucker, "Optimal control methods in computer vision and image processing," in Geometry Driven Diffusion in Computer Vision, edited by Bart ter Haar Romeny, Kluwer, 1994. 17. A. Kumar, A. Tannenbaum, and G. Balas, "Optical flow: a curve evolution approach," IEEE Transactions on Image Processing 5 (1996), pp. 598-611. 18. A. Kumar, S. Haker, C. Vogel, A. Tannenbaum, and S. Zucker, "Stereo disparity and L 1 minimization," to appear in Proceedings of CDC, December 1997. 19. R. Malladi, J. Sethian, B. and Vermuri, "Shape modelling with front propagation: a level set approach," IEEE P A M I 17 (1995), pp. 158-175. 20. D. Mumford and J. Shah, "Optimal approximations by piecewise smooth functions and associated variational problems," Comm. on Pure and Applied Math. 42 (1989). 21. H.-H. Nagel and W. Enkelmann, "An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences," IEEE Trans. Pattern Analysis and Machine Intelligence P A M I - 8 (1986), pp. 565593. 22. B. K. P. Horn, Robot Vision, MIT Press, Cambridge, Mass., 1986. 23. L. I. Rudin, S. Osher, and E. Fatemi, "Nonlinear total variation based noise removal algorithms," Physica D 60, pp. 259-268, 1992. 24. S. J. Osher and J. A. Sethian, "Fronts propagation with curvature dependent speed: Algorithms based on Hamilton-Jacobi formulations," Journal of Computational Physics 79 (1988), pp. 12-49. 25. P. Perona and J. Malik, "Scale-space and edge detection using anisotropic diffusion," IEEE Trans. Pattern Anal. Machine Intell. 12 (1990), pp. 629-639. 26. T. Poggio, V. Torte, and C. Koch, "Computational vision and regularization theory," Nature 317 (1985), pp. 314-319. 27. B. ter Haar Romeny (editor), Geometry-Driven Diffusion in Computer Vision, Kluwer, Holland, 1994. 28. L. I. Rudin, S. Osher, and E. Fatemi, "Nonlinear total variation based noise removal algorithms," Physica D 60 (1993), 259-268. 29. G. Sapiro and A. Tannenbaum, "On affine plane curve evolution," Journal of Functional Analysis 119 (1994), pp. 79-120. 30. G. Sapiro and A. Tannenbaum, "Invariant curve evolution and image analysis," Indiana University J. of Mathematics 42 (1993), pp. 985-1009. 31. B. G. Schunck, "The motion constraints equation for optical flow," Proceedings of the Seventh IEEE International Conference on Pattern Recognition, pages 20-22, 1984. 32. J. A. Sethian, "Curvature and the evolution of fronts," Commun. Math. Phys. 101 (1985), pp. 487-499. 33. J. A. Sethian, "A review of recent numerical algorithms for hypersurfaces moving with curvature dependent speed," J. Differential Geometry 31 (1989), pp. 131-161.
194
Allen Tannenbaum and Anthony Yezzi, Jr.
34. J. Shah, "Recovery of shapes by evolution of zero-crossings," Technical Report, Math. Dept. Northeastern Univ, Boston MA, 1995. 35. K. Siddiqi, Y. Lauziere, A. Tannenbaum, S. Zucker, "Area and length minimizing flows for segmentation," to appear in IEEE Transactions Image Processing, 1998. 36. C. Vogel, "Total variation regularization for ill-posed problems," Technical Report, Department of Mathematics, Montana State University, April 1993. 37. A. Yezzi, S. Kichenesamy, A. Kumar, P. Olver, and A. Tannenbaum, "Geometric active contours for segmentation of medical imagery," IEEE Trans. Medical Imaging 16 (1997), pp. 199-209. 38. A. Yezzi, "Modified mean curvature motion for image smoothing and enhancement, to appear in IEEE Trans. Image Processing, 1998.
Visual Control of Grasping Billibon H. Yoshimi and Peter K. Allen Department of Computer Science Columbia University New York, NY 10027 USA
S u m m a r y . Most robotic hands are either sensorless or lack the ability to accurately and robustly report position and force information relating to contact. This paper describes a system that integrates real-time computer vision with a sensorless gripper to provide closed loop feedback control for grasping and manipulation tasks. Many hand-eye coordination skills can be thought of as sensory-control loops, where specialized reasoning has been embodied as a feedback or control path in the loop's construction. This system captures the essence of these hand-eye coordination skills in simple visual control primitives, which can be used to perform higher-level grasping and manipulation tasks. Experimental results are shown for two typical robotics tasks: the positioning task of locating, picking up, and inserting a bolt into a nut under visual control and the visual control of a bolt tightening task.
1. I n t r o d u c t i o n As machine vision becomes faster and cheaper, adding visual control to a robotics task becomes feasible. This p a p e r describes the use of visual feedback to assist the grasping task. Grasping is a difficult problem t h a t encompasses m a n y degrees-of-freedom, a r b i t r a r y geometries of p a r t s to be grasped, physical issues such as stability and stable contacts over time, and real-time performance issues. While there have been a n u m b e r of detailed analyses of the kinematic and dynamic constraints necessary to effect stable grasps, m o s t require a high level of sensory input and feedback from the grasping device (i.e. robotic hand) to perform dextrous manipulation. The sensory information required typically includes contact point estimation, surface normal and curvature measures, and knowledge of b o t h applied and induced forces on the fingers of the hand. While great strides have been m a d e in robotic hand design and a n u m b e r of working dextrous robotic hands built, the reality is t h a t the sensory information required for dextrous manipulation lags the mechanical capability of the hands. Accurate and high bandwidth force and position information for a multiple finger h a n d is still difficult to acquire robustly. Our research is aimed at using vision to provide the compliance and robustness which assembly operations require without the need for extensive analysis, detailed knowledge of the environment or direct physical contact to control a complicated grasping and manipulation task. Using a visual sensor, we gain an understanding of the spatial a r r a n g e m e n t of objects in the environment without disturbing the environment, and can provide a means for providing robust feedback for a robot control loop.
196
Billibon H. Yoshimi and Peter K. Alien
We motivate the application of vision to grasping with the following example taken from manufacturing and assembly tasks. In most manufacturing tasks, it is necessary to have the ability to move parts together in useful configurations so as to make the assembly process more efficient. For example, moving fingers to surround a nut or moving a grasped nut to a bolt. In the case of grasping a nut, the robot must locate the nut, move its gripper to the vicinity of the nut, locate the best grasping points on the nut, servo the fingers of the gripper to those grasping points, and finally, verify that the grasp is sufficient for holding the nut. Usually these kinds of tasks are performed by blind robots who use apriori known object positions, jigs and other devices to remove the need for the robot to recover where the objects are located. Systems built using this kind of open-loop, pre-defined knowledge control usually require large start-up costs in pre-production measurement, setup and testing. These systems also exhibit inflexible, brittle qualities. If a small design change is made in how the product is manufactured, the cost of replanning the robot assembly line which may include extensive retooling and rejigging and a total revision of the robot control strategy can be prohibitively expensive. The research described in this paper demonstrates how computer vision can be used to alleviate many of the problems associated with these systems. It focuses on how simple visual control primitives can be used to provide feedback essential to the grasping problem. Other researchers have built systems that use vision to control robot motion. Hollingshurst and Cipolla [9] have developed a system for positioning a gripper above an object in the environment using an affine stereo transform to estimate the object's position. They correct for the transform's errors by using a second control scheme which converts the relative error in image space into a real world positioning change. Castano and Hutchinson [5] use visual constraint planes to create compliant surfaces for constraint robot movement in the real world. Both Hager et al. [7] and Feddema et al. [6] have used Image Jacobian-based control to perform various positioning tasks. Sharma et al. [14] use perceptual 3D surfaces to represent the workspace of the gripper and object and they plan their positioning tasks along these surfaces. Blake [4] has developed a computational model of hand-eye coordination that develops a qualitative theory for the classification of grasps that utilizes dynamic image contours. Other representative systems which have used visual control include [16, 13, 11, 15, 19, 2]. The system described in this paper is specifically aimed at merging vision with grasping by providing visual control of position and contact using sensorless robotic fingers and hands. Details can be found in [18].
2. V i s i o n
as a Feedback
Mechanism
for Grasping
Our goal is to visually monitor and control the fingers of a robotic hand as it performs grasping and manipulation tasks. Our motivation for this is the
Visual Control of Grasping
197
general lack of accurate and fast feedback from most robotic hands. Many grippers lack sensing, particularly at the contact points with objects, and rely on open loop control to perform grasping and manipulation tasks. Vision is an inexpensive and effective method to provide the necessary feedback and monitoring for these tasks. Using a vision system, a simple uninstrumented gripper/hand can become a precision device capable of position and possibly even force control. Below, we outline some aspects of visual control that are well suited to the grasping problem: 1. Visually determining grasp points. This is a preliminary step before grasping takes place, and may not be as time critical as the manipulation task itself. 2. Vision can be very important in dealing with unstructured and/or moving environments, where model-based knowledge may be unavailable or errorful. This is an example of the active vision paradigm. 3. Once a grasp has been effected, vision can monitor the grasp for stability. By perturbing the fingers, we can measure the amount of displacement and types of displacement in image space of the object. If the object does not move correctly, we can say that the grasp is faulty. 4. Visually monitoring a task provides feedback necessary both to perform the task and to gauge how well the task was performed, or if an error has occurred. While visual control of grasping can be very helpful, we need to recognize some problems associated with it. The problems listed below need to be adequately addressed in order to successfully control grasping using vision, and are at the crux of why this is a difficult robotics problem. 1. Grasping and manipulation need real-time sensory feedback. Vision systems may not be able to provide the necessary analysis of the image and computation of an actuator movement fast enough. 2. In grasping with a robotic hand, multiple fingers need to be employed. This entails having the vision system follow multiple moving objects in addition to the possible movement of any object to be manipulated. 3. Grasping and manipulation require 3-D analysis of relative relationships of fingers and objects. Vision systems only provide a 2-D projection of the scene. 4. As fingers close in on an object to be manipulated, visual occlusion of both the object and fingers can easily occur. 5. Grasping tasks require knowledge of forces exerted by fingers. Vision systems can not directly compute accurate force measurements. Visual control of grasping is not a panacea. The integration of vision and local contact/force information is needed for truly precise control of grasping and manipulation. The work described in this paper is aimed at highlighting what vision can provide. This work can be extended to model the interplay
198
Billibon H. Yoshimi and Peter K. Allen
Fig. 2.1. Left: Experimental system used to test visual control of grasping and manipulation. Right: Toshiba Hand in position for the guarded-move experiment. Both fingers are not touching the bolt in this picture. of vision and touch for even more complex tasks, including the analysis of partially occluded regions of space and complicated multifingered grasps with force control. [1, 3]. Many eye-hand coordination skills can be thought of as sensory-control loops, where specialized reasoning has been embodied as a feedback or control path in the loop's construction. This paper describes a framework that captures the essence of these eye-hand coordination skills in simple visual control primitives. The system pictured in figure 2.1 shows the major components of our experimental system: a PUMA 560 robot, a four-fingered Toshiba FMA gripper (see section 6. for a detailed description of this device), and two static stereo cameras that view the robot's workspace. We will describe examples using a simple, sensorless multifingered gripper that can perform higher level grasping tasks using visual feedback alone.
3. V i s u a l
Control
Primitive
Operations
We required the vision system to track multiple objects in real time with as little delay as possible. In our system, each moving target is a fiducial mark (a black dot on a white background) which can be attached to a robot or other objects in the environment. The upper bound of the number of targets trackable at any one time is 255. Each finger of the robotic hand has 4 fiducial marks, and each object to be manipulated also has 1 or 2 fiducial marks. The tracker uses intensity thresholds to segment the fiducial marks from the background. To obtain the 3D position of objects to be tracked we use stereo correspondence from two calibrated cameras. Once a feature has been identified in each camera, back-projection is used to determine its 3D position in space.
Visual Control of Grasping
199
The accuracy of this procedure is highly dependent on the accuracy of the camera calibrations. In our system, we perform a simple least squares calibration on each camera. Correspondence between features in each camera is determined initially by the user and the tracking maintains these correspondences at each sample interval. The current version of the algorithm allowed us to perform stereo tracking of 12 features at approximately 15 Hz with an inherent processing delay of 116.7 ms. The tracking algorithm below has been implemented on a PIPE multiple stage, pipelined parallel vision processor [10]. 1. A camera images the scene and the image is sent to the processing stages in the PIPE image processor. 2. The operator indicates to the vision algorithm the position of the fiducial marks corresponding to the gripper, bolt and nut in both tracking images. This process could have been automated with an automatic feature extractor, but the experiments described here used manual seeding. Each fiducial mark in each image is given a unique number to facilitate tracking, and is updated during subsequent iterations of the algorithm. 3. Each fiducial mark region in the tracking image is morphologically expanded by 2 pixels, which serves as a search mask for the next frame. The entire process takes 2 frame cycles or (1/30th of a second) to perform. 4. The algorithm uses the expanded regions in the tracking image as a template for finding fiducial marks in the intensity image. The fiducial mark intensity constraints associated with each region in the tracking image are applied to the new intensity image. If a position satisfies the intensity constraint for a region, it is added to that region, otherwise, it is deleted. The resulting tracking image contains regions which reflect the new observed positions of the fiducial marks. This step takes 1/60th of a second. 5. The tracking image is also passed through a histogram/accumulator processor. The PIPE accumulates statistics necessary for determining the centroid of each uniquely numbered region in the image. This processes also takes 1/60th of a second, and is performed in parallel for both horizontal and vertical camera coordinates and for both cameras. The total delay in finding the centroid positions for a set fiducial marks in both cameras is 116.7 ms.
4. C o n t r o l
Algorithm
The vision system is capable of delivering new visual information at 15 Hz rates. The robot system, though, requires kHz order updates. This discrepancy between the robot controller's minimum update rate and the vision system's capability of producing data required that we devise a scheme which allowed each system to operate without unduly restricting the other. The
200
Billibon H. Yoshimi and Peter K. Allen
solution we derived takes the form of a two-level controller. In the low-level robot control part of the system, 3D positions in the real world, ]42, are passed to a low-level trajectory generator which generates the intermediate positions which the robot must move to between the robot's current position and a desired goal position. This part of the system must operate at precisely specified intervals. In this case, the system controller required a new position u p d a t e every 0.83 ms otherwise the robot crashes. In the high-level control part of the system the vision system observes the current position of the robot and other objects in the environment and updates its internal representation of the world. This representation is then used by the high-level control process to either correct the current goal position used by the low-level t r a j e c t o r y generator or to halt itself if the operation has met its halting criterion. By separating the controller into two halves, the controller update problem is no longer constrained by the video processing delay. Hence, even if visual data were not available during a low-level control cycle, the t r a j e c t o r y generator could still use the available goal position to derive a new intermediate position. The basic problem in visual control is to relate an image space signal to a corresponding control movement. In our case, the robot's position in the real world, W, is known and controlled by us. W h a t is unknown is the relationship between the position error observed by the vision system and the corresponding control movement in ~Y. In our system, we compute the error in position between a known point on the robot, P1, and a goal point in the workspace, P2, as E = P2' - P___!I' (4.1) P I ' and P____22'are the back-projected locations of the two points which have been imaged by the stereo cameras. Hence, we are defining a relative error computed from the vision system that can be used to position the robot. Since we are dealing with point features only, we restrict our analysis to translational degrees of freedom; we assume the gripper is in a fixed pose. Given this error vector, which is expressed in 3D space, we can position the robot to the new position that will zero the error. However, the new position error will only be zero if the calibration is exact, which in our case it may not be. To alleviate this, we continually update this position error from the vision system. As we move the robot in the direction of the error vector, we can update the relative positions of the points P 1 and P 2 using the vision system. Small errors in positioning can therefore be compensated for in an online fashion. At each step of the process we are recomputing and reducing the relative position error. As this is a simple position controller, it can suffer from the usual problems o f oscillation a n d / o r infinite convergence. In our case, we were able to control the position accurately enough using a variable position gain and a cut off threshold to prevent oscillation. Other classical control techniques employ derivative a n d / o r integral terms to fix these problems. The controller finishes when E falls below some critical threshold. Using
Visual Control of Grasping
201
this technique, we were able to insert a peg with a 5mm diameter tip into a 9mm diameter hole with no problems using this control method. As the calibration degrades, so does the algorithm's performance. While our m e t h o d may not converge, results from Maybank and Faugeras [12] and Hager et al. [8] indicate that a convergence exists if bounds can be placed on the errors in the calibration matrix. The system is also affected by stereo triangulation errors due to object movement and processing delays. Errors introduced by sampling the left and right image streams at different points in time can be interpreted as a change in a tracked object's depth and tracked points can often be seen "oscillating" about the true tracked path. This problem can be alleviated by using a velocity threshold for the robot (in our experiments, about 307o of the robot's maximum velocity).
5. I m p l e m e n t e d
Primitives
The Move-to-3D-position primitive generates movement commands causing the object being tracked to move to the position specified. T h e position is a 3D point specified in real world coordinates. The object being tracked is a known 3D point which the robot can control such as the wrist of the robot, a finger on the manipulator, or an object being manipulated. T h e Guarded-Move primitive generates movement commands causing the object being tracked to move to the target position specified without contact. If any contact occurs, the object being controlled is stopped from moving, and an exception is raised upon detecting the guard condition. Contact is determined by the vision system alone. These two primitives can be thought of as components of a coarse/fine positioning strategy. While the Move-to-3D-position primitive is useful for quickly moving a manipulator in an environment, its utility is strictly confined to non-contact maneuvers. If more careful motion is desired, the Guarded-Move primitive operation can be used to cross over from noncontact to contact based visual control. The Guarded-Move primitive assumes that the 3D coordinates of F , a point on the manipulator (e.g. a fiducial mark on a finger) and O, some point in the environment (e.g. a feature on an object to be manipulated) are known. As we move along the t r a j e c t o r y from F into contact with O, if the observed position of F does not change, we can assume contact has occurred. This relationship can be generalized for a set of fingers, Fi, and a set of points associated with an object, Oi. Using this formulation, we can determine when a group of fingers has established contact with a given object using visual control. Our implementation detects when contact has been established in the following manner. A finger is commanded to close in upon a designated 3D point on an object. T h e controller monitors the movement of each finger feature, Fi. As each finger makes contact with the object, it's corresponding feature point stops moving. We attribute the lack of movement to the object
202
Billibon H. Yoshimi and Peter K. Allen
exerting a force against the finger. The halting condition following algorithm. At any time, t, there is a 3D position controlled point called, Pt (xt,Yt, zt). For all times, t > variance taken over the x, y, z components of the n most positions. If this value falls below a small threshold, e, we to terminate. This termination condition can be written
is defined using the associated with our n, we compute the recent manipulator order the algorithm more explicitly as:
~ / o [ ~ , , . . . , x,_~]2 + a [ y , , . . . , y,_~]2 + a [ z , , . . . , z,_~]2 < ~
(5.1)
In the experiments described below we use the value n = 5 and e -- l m m 2.
Fig. 5.1. 3D x, y, and z positions of the fiducial marks as finger approached the center of the bolt. The horizontal axis is time coded in control cycles. The vertical axis encodes the real world position of the fiducial mark in mm. Note: the scales on the 3 vertical axes are different: 2ram on x, 10mm on y and 5ram on z. Figure 2.1 is a picture of the experimental setup used in the Guarded-Move experiment. The robot is ordered to move the finger on the left to the center of the bolt. During the experiment, two cameras mounted approximately 1 meter away from the scene with a base-line separation of .3 meters and 50 mm lenses (not visible in the picture) digitize the environment. Figure 5.1 shows the 3D positions of the fiducial mark for a complete experiment (where the gripper finger goes from a fully open position to a position which satisfies the guard condition (see equation 5.1)). These figures show that the finger traveled approximately 95mm from its initial starting state to its final resting state. The halting condition is satisfied by the constant x, y, z positions after time t = 37.
6. E x p e r i m e n t :
Bolt
Extraction
and
Insertion
The first experiment was to have a robot grasp a bolt, extract it, and insert it into a nut. This experiment treated the gripper as an open loop device, but controlled the hand position using visual feedback. T h e gripper we used is a Toshiba FMA gripper (see figure 2.1). The gripper is comprised of 4 fingers where each finger is made up of 3 stretchable, flexible chambers [17]. The gripper can be positioned in a variety of different poses by changing the pressure in each of the finger chambers. The original design used binary pressure
Visual Control of Grasping
203
valves which limited the movement of each finger to only eight possible configurations. There are three problems with the gripper which make it difficult to use. First, the gripper system is totally devoid of sensors. This is a common problem shared by many other grippers. The variation in a finger's dimensions as it changes positions makes it very difficult to attach sensors to the fingers. Second, the control space of the fingers is non-linear. Changes in pressure to a chamber do not necessary correspond to a equivalent linear change in finger position. In a later section, we will describe how we used visual control to compensate for this problem. We increased the resolution of the finger's workspace by adding continuous pneumatic servo valves. Each valve took an input voltage and converted it into a pressure output. By changing the voltage, we could change the pressure inside each finger chamber. These voltages were generated by a D / A board attached to the computer. We also simplified the control problem by constraining the movement of each finger to lie in a plane. Since driving all three chambers causes the finger to extend, confining all legal finger configurations to those in which a pressure is applied to maximum of only two chambers inside each finger at any one time simplifies the control problem. The constraints on the problem guarantee that at most 2 vectors are used to uniquely describe any position in the 2D gripper workspace. We calibrated each finger with the environment by ordering the finger to move to a number of positions defined in it's 2D gripper workspace. The fiducial mark associated with the tip of the finger was tracked in 3D using a calibrated stereo vision system. A plane was fit to these points and the transformation from 2D gripper coordinates to this plane was recovered using a least squares technique. The testbed for these robot positioning tasks is shown in figure 2.1. T h e goal of the robot system is to perform the robotic task of approaching, grasping and extracting a bolt from a fixture, approaching a nut, and finally, inserting the bolt into the nut. We can describe this task as A L I G N - 1 - T h e robot is first commanded to move to a point 100mm directly above the bolt; A P P R O A C H - The robot is then moved to a point where the gripper surrounds the bolt head; G R A S P - The gripper is then closed around the bolt head; E X T R A C T - The bolt is extracted to a point 100mm above its fixture; A L I G N - 2 - We then track the tip of the bolt which contains a fiducial mark and move it to a point 100ram directly above the nut; I N S E R T - Finally, the bolt tip is positioned coincident with the nut and inserted. The gripper is initially approximately 350mm behind and 115mm above the nut. Figure 6.1 shows the robot during the various phases of the task. In each case, the robot was able to successfully position itself using visual feedback to perform the task. Figure 6.2 is an overhead projection which shows the complete path taken by the robot for the complete task. The path is shown as an X - Y orthographic projection, where X - Y are parallel to the surface of the table. The robot starts at position (100, - 5 0 0 ) . The robot is ordered to align its gripper with the bolt on the left. The overshoot displayed in the
204
Billibon H. Yoshimi and Peter K. Allen
Fig. 6.1. Top Left: A L I G N - 1 task: Image of workspace after hand was commanded to move 100ram above fiducial mark on bolt. Bottom Left: A P P R O A C H task: Image of workspace after gripper is moved to surround bolt head. Top Right: A L I G N - 2 task: Image of workspace after grasped bolt is moved 100mm above nut. Bottom Pdght: I N S E R T task: Bolt is inserted into nut. Fiducial mark on nut is on inside of threads, and is not visible from this angle.
lower left of the figure is typical of a proportional control scheme. T h e r o b o t then approaches, grasps and extracts the bolt. T h e robot aligns the gripped bolt with the nut. Once the bolt is aligned, the robot inserts the bolt into the nut. The final position of the r o b o t is ( - 2 5 0 , - 5 6 0 ) . All m o v e m e n t s were carried out with three degrees of translational freedom and no rotations. T h e I N S E R T p a r t of the t a s k is to insert the tip of the bolt into the nut using the positioning primitive described in the previous section. In order to accomplish this precision m a t i n g task, we implemented coarse-fine control. W h e n the distance between the bolt tip and target was greater t h a n 50mm, the robot was c o m m a n d e d to move at 30 percent of its m a x i m u m velocity. If the distance was smaller t h a n 50mm, this velocity was decreased to 5 percent of the m a x i m u m velocity. Utilizing this controller, we able to insert the bolt with a 5 m m diameter tip into a 9 m m diameter hole. To assess the affect of calibration errors, we moved the right c a m e r a approximately 4 0 0 m m from its calibrated position. T h e robot was t h e n instructed to perform the same bolt extraction and insertion task. Figure 6.2
Visual Control of Grasping
205
4~
~0N-2
~o
"l*"
-4m
io
'i -~o
~0
, ~o
i -15o
i -I~
i ~o
I 9
o
5o
100
I
I
I X
I
I
~
Fig. 6.2. Left: (x, y) trajectory taken by the robot in }zV for the entire motion. Right: (x, y) trajectory taken by robot as it after the right camera has been moved approximately 400ram from its calibrated position. also shows trajectory taken by the robot while performing the align-approachgrasp-extract-align-insert task. The effects of calibration are noticeable. The trajectory taken by the robot in performing the alignment task is not straight. The curved path demonstrates how relative positioning can be used to correct for calibration error. In both approach tasks, the robot position oscillates noticeably near the end of the movement. In spite of these calibration errors, the system was still able to function correctly albeit sub-optimally. The accuracy of our experimental system depends on many factors. We rely on the camera's ability to pin-point the location of objects in the environment and the ability of the camera calibration to determine the 3D position of the objects in the environment. Noise in the image formation process can cause an object which appears in one location to change position in adjacent sampling periods. When passed through the 2D-3D image reconstruction process, this error is magnified. We have tried to limit the effect of these problems by reducing the size of the fiducial marks (decreasing the possible chances for error in spatial localization) and decreasing the speed of the robot (decreasing the error caused by temporal aliasing).
7. Experiment: Two-Fingered Bolt Tightening The bolt tightening task requires that the fingers first grasp the object and then move synchronously when performing the task, maintaining contact. Also, the bolt tightening task requires that the system compute the new intermediate configurations between the start and goal states of the rotated object. We have decomposed the task into four, individual subtasks. As shown in figure 6.3, the four subtasks are move to pretighten position, grasp bolt, tighten bolt, and return to home state. We have devised a visual control strategy which uses this finite state diagram to control the task as follows:
206
Billibon H. Yoshimi and Peter K. Allen
rasp ~ P,i~tr~mr (grasp ~ . -49o
Pmtlghtan
Grasp
~
C
-slo
-540 -5SO
Return home
Tighten
-.~60
i
i
~
teo
tTo
tso
i
19o x posen m mm
i
i
i
2oo
21o
2~o
Fig. 6.3. Left: Finite state representation of the bolt tightening tasks. Two smaller circles represent fingers. The large circle with the rectangle through the center represents the bolt head. Each graphic is an overhead view of the movement and position of the bolt and fingers during one part of the bolt tightening process. Right: Combined x-y plot for grasping and tightening experiment including the position of a fiducial mark on the bolt head (overhead view). 1. move to pretighten position - Here we use two Move-to-3D-position operations. One brings the left finger as far in front of the bolt as possible (in the negative x direction) and one which moves the right finger as far in back of the bolt as possible (in the positive y direction). 2. grasp bolt head - The goal for both fingers is to make a Guarded-Move to the center of the bolt head. We use a predetermined set point 15mm behind the fiducial mark to mark the center of the bolt head. 3. tighten bolt - Tightening the bolt is accomplished by moving b o t h fingers, simultaneously, in opposite directions. The goal positions for both finger are 3D positions either 10mm in front of or 10mm behind their current positions. 4. return to home state - Given the initial state of the robot gripper, each finger of the gripper is ordered to the 3D position from whence it started using the Move-to-3D-position primitive. For our experiments, we inserted another intermediate return home position called the post-tighten position. Ideally, the post-tighten position takes the finger directly away from the bolt, avoiding the problems caused by a glancing or dragging contact. The post-tighten position is defined to be 20mm behind the home position of the finger in the case of the left finger or 20mm in front of the home position of the finger for the right. Figure 7.1 shows the positions of the gripper system during the experiment. T h e fiducial marks used for feedback in this experiment are the marks closest to the tips of each finger (mounted on an extended structure attached to each finger tip to prevent minor finger occlusions). While not obvious from the static pictures, the fingers actually traveled in an arc around the rotational axis of the bolt. Figure 6.3 shows the position where the fingers moved
Visual Control of Grasping
207
Fig. 7.1. (a-d). a) position of the fingers at the start of the bolt tightening experiment. b) Position of fingers after grasping bolt. c) Position of fingers after finishing the tightening operation, d) The end position of the fingers after returning to home state. and the progress of the bolt head which was also under visual control. It is an overhead shot showing the x and y positions of the points in YV ignoring the z-coordinate values. Since the torques exerted by both fingers was sufficient to overcome the bolt's stiction and jamming forces, it was possible for the gripper to turn the bolt. If, on the other hand, the movement vector generated by the guarded move did not provide sufficient torque, the bolt would not have moved. By observing the movement of the bolt head, we can verify that it actually moved during the operation. Without this information it would be difficult to determine if the bolt had jammed during the task. Recognizing this can alert a higher level process to take an alternative strategy to the current operation. We have not implemented such feedback but note that such a strategy is possible.
8. C o n c l u s i o n s This system is intended to motivate the idea of adding simple and inexpensive vision systems to existing robots and grippers to provide the necessary feedback for complex tasks. In particular, we have shown how vision can be used for control of positioning and grasping tasks in typical manufacturing and assembly environments. Visual primitives have been described and implemented that can be used to construct higher level grasping and manipulation tasks. By decomposing a complex manipulation into a series of these operations, we removed much of the complexity associated with creating a visual control system. Some aspects of the system can obviously be improved upon. For example, the system uses a simple proportional control scheme which could be augmented with a more complex and perhaps faster controller. Manually seeded fiducial marks were used as visual features to facilitate real-time control, where perhaps a somewhat slower object recognition algorithm could have been used. Occlusion problems were not experienced in this experimental setup, although they could easily crop up in many environments.
208
Billibon H. Yoshimi and Peter K. Allen
Despite these shortcomings, the idea of visual control for sensorless grasping devices remains valid. It is relatively easy to designate features on a robotic hand, and then apply the primitives described here to effect visual control. As vision hardware and software continues to improve, the utility of this approach becomes apparent. Acknowledgement. This work was supported in part by DARPA contract DACA76-92-C-007, AASERT award DAAHO4-93-G-0245, NSF grants CDA-90-24735, IRI-93-11877 and Toshiba Corporation.
References 1. S. Abrams. Sensor Planning in an active robot work-cell. PhD thesis, Department of Computer Science, Columbia University, January 1997. 2. P. Allen, A. Timcenko, B. Yoshimi, and P. Michelman. Automated tracking and grasping of a moving object with a robotic hand-eye system. IEEE Trans. on Robotics and Automation, 9(2):152-165, 1993. 3. P. K. Allen, A. Miller, P. Oh, and B. Leibowitz. Using tactile and visual sensing with a robotic hand. In IEEE Int. Conf. on Robotics and Automation, pages 676-681, April 22-25 1997. 4. A. Blake. Computational modelling of hand-eye coordination. In J. Aloimonos, editor, Active Perception. Lawrence Erlbaum Associates, Inc., 1993. 5. A. Castano and S. Hutchinson. Visual compliance: Task-directed visual servo control. IEEE Trans. on Robotics and Automation, 10(3):334-342, June 1994. 6. J. Feddema and C. S. G. Lee. Adaptive image feature prediction and control for visual tracking with a hand-eye coordinated camera. IEEE Transactions on Systems, Man and Cybernetics, 20:1172-1183, Sept./Oct. 1990. 7. G. Hager, W. Chang, and A. Morse. Robot feedback control based on stereo vision: Towards calibration-free hand-eye coordination. In Proc. IEEE Conf. on Robotics and Automation, volume 4, pages 2850-2856, 1994. 8. G. D. Hager. Six DOF visual control of relative position. DCS RR-1038, Yale University, New Haven, CT, June 1994. 9. N. Hollinghurst and R. Cipolla. Uncalibrated stereo hand-eye coordination. Technical Report CUED/F-INFENG/TR126, Department of Engineering, University of Cambridge, 1993. 10. E. W. Kent, M. O. Shneier, and R. Lumia. Pipe: Pipelined image processing engine. Journal of Parallel and Distributed Computing, (2):50-78, 1985. 11. A. Koivo and N. Houshangi. Real-time vision feedback for servoing robotic manipulator with self-tuning controller. IEEE Transactions on System, Man, and Cybernetics, 21, No. 1:134-142, Feb. 1991. 12. S. Maybank and O. Faugeras. A theory of self-calibration of a moving camera. International Journal of Computer Vision, 8(3):123-151, 1992. 13. N. Papanikolopoulos, B. Nelson, and P. Khosla. Six degree-of-freedom hand/eye visual tracking with uncertain parameters. In Proc. of IEEE International Conference on Robotics and Automation, pages 174-179, May 1994. 14. R. Sharma, J. Herve, and P. Cucka. Analysis of dynamic hand positioning tasks using visual feedback. Technical Report CAR-TR-574, Center for Auto. Res., University of Maryland, 1991. 15. S. Skaar, W. Brockman, and R. Hanson. Camera-space manipulation. International Journal of Robotics Research, 6(4):20-32, Winter 1987.
Visual Control of Grasping
209
16. T. M. Sobh and R. Bajcsy. Autonomous observation under uncertainty. In Proc. of IEEE International Conference on Robotics and Automation, pages 1792-1798, May 1992. 17. K. Suzumori, S. Iikura, and H. Tanaka. Development of a flexible microactuator and its application to robotic mechanisms. In IEEE International Conference of Robotics and Automation, pages 1622-1627, April 1991. 18. B. Yoshimi. Visual Control of Robotics Tasks. PhD thesis, Dept.of Computer Science, Columbia University, May 1995. 19. B. H. Yoshimi and P. K. Allen. Active uncalibrated visual servoing. IEEE Transactions on Robotics and Automation, 11(5):516-521, August 1995.
Dynamic Vision Merging Control Engineering and AI Methods Ernst D. Dickmanns Universits der Bundeswehr Miinchen D-85577 Neubiberg, Germany
S u m m a r y . A survey is given on two decades of development in the field of dynamic machine vision for vehicle control. The '4-D approach' developed integrates expectation-based methods from systems dynamics and control engineering with methods from AI. Dynamic vision is considered to be an animation process exploiting background knowledge about dynamical systems while analysing image sequences and inertial measurement data simultaneously; this time oriented approach has allowed to create vehicles with unprecedented capabilities in the technical realm: Autonomous road vehicle guidance in public traffic on freeways at speeds beyond 130 kin/h, on-board-autonomous landing approaches of aircraft, and landmark navigation for AGV's as well as for road vehicles including turn-offs onto cross-roads.
1. I n t r o d u c t i o n Dynamic remote sensing for intelligent motion control in an environment with rapidly changing elements requires the use of valid spatio-temporal models for efficient handling of the large data streams involved. Other objects have to be recognized together with their relative motion components, the near ones even with high precision for collision avoidance; this has to be achieved while the own vehicle body carrying the cameras moves in an intended way and is, simultaneously, subject to perturbations hardly predictable. In the sequel, this task is to be understood by the term 'dynamic vision'. For this complex scenario, inertial sensing in addition to vision is of great help; negative angular rate feedback to a viewing direction control device allows to stabilize the appearance of stationary objects in the image sequence. Measured accelerations and velocities will, via signal integration, yield predictions for translational and rotational positions affecting the perspective mapping process. These predictions are good in the short run, but may drift slowly in the long run, especially when inexpensive inertial sensors are used. These drifts, however, can easily be compensated by visual interpretation of static scene elements. In order to better understand dynamic scenes in general, it is felt that deeper understanding of characteristic motion sequences over some period of time will help. Since characteristic motion of objects or moves of subjects are specific to certain task domains and classes of objects or subjects, this type of dynamic vision requires spatio-temporal models for solving the
Dynamic Vision Merging Control Engineering and AI Methods
211
momentaneous interpretation problem including recognition of control and perturbation inputs. Subjects are defined as objects with internal control actuation capabilities at their disposal [Dic 89]; this control actuation may be geared to sensory perception in combination with behavior decisions, taking background knowledge into account. In this definition, both animals and autonomous vehicles belong to this class of subjects. They may pursue goals or act according to plans (intentions) or to some local sources of information. The recognition of intentions of other subjects will give the observing agent lead time in order to adjust to longer term predictions. Knowledge about goals and plans of other subjects, thus, allows advantageous decisions for own behavior. The capability to recognize intentions early, therefore, is considered to be an essential part of intelligence. The 4-D approach starts from generic spatio-temporal models for objects and subjects (including stereotypical control sequences for performing maneuvers and for achieving goals) and, thus, allows this kind of task and context oriented interpretations by providing the proper framework. This is more than just the inversion of perspective projection; it is a type of animation, driven by image sequences, taking known cause and effect relations and temporal integrals of actions into account.
2. S i m u l t a n e o u s r e p r e s e n t a t i o n s on differential and multiple integral scales Combined use of inertial and visual sensing is well known from biological systems, e.g. the vestibular apparatus and its interconnections to eyes in vertebrates. In order to make optimal use of inertial and visual signals, simultaneous differential and integral representations on different scales both in space and in time are being exploited; table 1 shows the four categories introduced: The upper left corner represents the point 'here and now' in space and time where all interactions of a sensor or an actuator with the real world take place. Inertial sensors yield information on local accelerations (arrow 1 from field (1,1) to field (3,3) in the table) and turn rates of this point. Within a rigid structure of an object the turn rates are the same all over the body; therefore, the inertially measured rate signals (arrow 2 from field (1,3) to (3,3)) are drawn on the spatial object level (row 3). The local surface of a structure may be described by the change of its tangent direction along some arc length; this is called curvature and is an element of local shape. It is a geometrical characterization of this part of the object in differential form; row 2 in table 1 represents these local spatial differentials which may cause specific edge features (straight or curved ones) in the image under certain aspect conditions. Single objects may be considered to be local spatial integrals (represented in row 3 of table 1), the shapes of which are determined by their spatial cur-
212
Ernst D. Dickmanns
range i n time--~ in space
temporally local differential
point in time
environment
local time integrals r -, ,basic cycle time,
extended local time integrals
_~ '-
global time integrals
temporal change single step at point 'here' transition matrix (avoided derived from because of noise notation of (local) Namplitlcation) 'objects' (row 3) differential i 'here and
point in space
now ~ local measurements
spatially local differential environment
geometry: edge angles, positions & curvatures
transition of feature parameters
~
gtate transition
local
space integrals --~ to b j e c t s *
feature history
object state 2 f e a t u r e - 3[ distribution, shape
44
contralnts: diff. eqs., 'dye. model'
changed aspect conditions 'central hub'
short range predictions, -object state
'~!
history
sparse prediction, ~bject state history
J maneuver space of objects
m~sion space of objects
local situations
actual global
situation
information for e f f i c i e n t controllers
~ single step prediction of situation (usually not done)
--
multiple step
prediction of situation; monitoring of maneuvers \
monitoring
\ ~k mission performance, monitoring
T a b l e 2.1. Differential and integral representations on different scales for dynamm
perception
vature distributions on the surface; in connection with the aspect conditions and the photometric properties of the surface they determine the feature distribution in the image. Since, in general, several objects may be viewed simultaneously, also these arrangements of objects of relevance in a task context, called 'geometrical elements of a situation', are perceived and taken into account for behavior decision and reactive control. For this reason, the visual data input labeled by the index 3 at the corresponding arrows into the central interpretation process, field (3,3), has three components: 3a) for measured features not yet associated with an object, the so-called detection component; 3b) the object- oriented tracking component with a strong predictive element for improving efficiency, and 3c) the perception component for the environment which preshapes the maneuver space for the self and all the other objects. Seen this way, vision simultaneously provides geometrical information both on differential (row 2) and integral scales (rows: 3 for a single objects, 4 for local maneuvering, and 5 for mission performance). Temporal change is represented in column 2 which yields the corresponding time derivatives to the elements in the column to the left. Because of noise
Dynamic Vision Merging Control Engineering and AI Methods
213
amplification associated with numerical differentiation of high frequency signals ( d / d t ( A sin(wt) = A w cos(wt)), this operation is usable only for smooth signals, like for computing speed from odometry; especially, it is avoided deliberately to do optical flow computation at image points. Even on the feature level, the operation of integration with a smoothing effect, as used in recursive estimation, is preferred. In the matrix field (3,2) of table 1 the key knowledge elements and the corresponding tools for sampled data processing are indicated: Due to mass and limited energy availability, motion processes in the real world are constrained; good models for unperturbed motion of objects belonging to specific classes are available in the natural and engineering sciences which represent the dependence of the temporal rate of change of the state variables on both the state- and the control variables. These are the so-called 'dynamical models'. For constant control inputs over the integration period, these models can be integrated to yield difference equations which link the states of objects in column 3 of table 1 to those in column 1, thereby bridging the gap of column 2; in control engineering, methods and libraries with computer codes are available to handle all problems arising. Once the states at one point in time are known, the corresponding time derivatives are delivered by these models. Recursive estimation techniques developed since the 60ies exploit this knowledge by making state predictions over one cycle disregarding perturbations; then, the measurement models are applied yielding predicted measurements. In the 4-D approach, these are communicated to the image processing stage in order to improve image evaluation efficiency (arrow 4 from field (3,3) to (1,3) in table 1 on the object level, and arrow 5 from (3,3) to (2,3) on the feature extraction level). A comparison with the actually measured features then yields the prediction errors used for state update. In order to better understand what is going to happen on a larger scale, these predictions may be repeated several (many) times in a very fast in advance simulation assuming likely control inputs; for stereotypical maneuvers like lane changes in road vehicle guidance, a finite sequence of 'feed- forward' control inputs is known to have a longer term state transition effect. These are represented in field (4,4) of table 1 and by arrow 6; section 6 below will deal with these problems. For the compensation of perturbation effects, direct state feedback well known from control engineering is used. With linear systems theory, eigenvalues and damping characteristics for state transition of the closed loop system can be specified (field (3,4) and row 4 in table 1). This is knowledge also linking differential representations to integral ones; low frequency and high frequency components may be handled separately in the time or in the frequency domain (Laplace-transform) as usual in aero-space engineering. This is left open and indicated by the empty row and column in table 1. The various feed-forward and feedback control laws which may be used in superimposed modes constitute behavioral capabilities of the autonomous
214
Ernst D. Dickmanns
vehicle. If a sufficiently rich set of these modes is available, and if the system is able to recognize situations when to activate these behavioral capabilities with which parameters for achieving mission goals, the capability for autonomous performance of entire missions is given. This is represented by field (n,n) (lower right corner) and will be discussed in sections 6 to 8. Essentially, mission performance requires proper sequencing of behavioral capabilities in the task context; with corresponding symbolic representations on the higher, more abstract system levels, an elegant symbiosis of control engineering and AI-methods can thus be realized.
3. T a s k d o m a i n s Though the approach is very general and has been adapted to other task domains like aircraft landing approaches and helicopter landmark navigation also, only road vehicle guidance will be discussed here. The most well structured environments for autonomous vehicles are freeways with limited access (high speed vehicles only) and strict regulations for construction parameters like lane widths, maximum curvatures and slopes, on- and off-ramps, no same level crossings. For this reason, even though high speeds may be driven, usually, freeway driving has been selected as the first task domain for autonomous vehicle guidance by our group in 1985. On normal state roads the variability of road parameters and of traffic participants is much larger; especially, same level crossings and oncoming traffic increase relative speed between objects, thereby increasing hazard potential even though traveling speed may be limited to a much lower level. Bicyclists and pedestrians as well as many kinds of animals are normal traffic participants. In addition, lane width may be less in the average, and surface state may well be poor on lower order roads, e.g. potholes, especially in the transition zone to the shoulders. In urban traffic, things may be even worse with respect to crowdedness and crossing of subjects. These latter mentioned environments are considered to be not yet amenable to autonomous driving because of scene complexity and computing performance required.
4. T h e s e n s o r y
systems
The extremely high data rates of image sequences axe both an advantage (with respect to versatility in acquiring new information on both environment and on other objects/subjects) and a disadvantage (with respect to computing power needed and delay time incurred until the information has been extracted from the data). For this reason it makes sense to also rely on conventional sensors in addition, since they deliver information on specific output variables with minimal time delay.
Dynamic Vision Merging Control Engineering and AI Methods
215
Fig. 4.1. Binocular camera arrangement of VaMP 4.1 Conventional
sensors
For ground vehicles, odometers, speedometers as well as sensors for positions and angles of subparts like actuators and pointing devices are commonplace. For aircraft, inertial sensors like accelerometers, angular rate- and vertical as well as directional gyros are standard. Evaluating this information in conjunction with vision alleviates image sequence processing considerably. Based on the experience gained in air vehicle applications, the inexpensive inertial sensors like accelerometers and angular rate sensors have been adopted for road vehicles too, because of the beneficial and complementary effects relative to vision. P a r t of this has already been discussed in section 2 and will be detailed below. 4.2 Vision
sensors
Because of the large viewing ranges required, a single camera as vision sensor is by no means sufficient for practical purposes. In the past, bifocal camera arrangements (see fig.l) with a wide angle (about 45 ~ and a tele camera (about 15 ~ aperture) mounted fix relative to each other on a two-axis platform for viewing direction control have been used [Dic 95a]; in future systems, trinocular camera arrangements with a wide simultaneous field of view (> 100 ~ from two divergently mounted wide angle cameras and a 3-chip color CCD-camera will be used [Dic 95b]. For high-speed driving on German Autobahnen, even a fourth camera with a relatively strong tele-lens will be added allowing lane recognition at several hundred meters distance. All these data are evaluated 25 times per second, the standard European video rate.
216
Ernst D. Dickmanns
Fig. 5.1. Multiple feedback loops on different space scales for efficient scene interpretation and behavior control: control of image acquisition and -processing (lower left corner), 3-D 'imagination'-space in upper half; motion control (lower right corner).
5. Spatio-temporal perception: The 4-D approach Since the late 70ies, observer techniques as developed in systems dynamics [Lue 64] have been used at UBM in the field of motion control by computer vision [MeD 83]. In the early 80ies, H. J. Wuensche did a thorough comparison between observer- and Kalman filter realizations in recursive estimation applied to vision for the original task of balancing an inverted pendulum on an electro-cart by computer vision [Wue 83]. Since then, refined versions of the Extended Kalman Filter (EKF) with numerical stabilization (UDU Tfactorization, square root formulation) and sequential updates after each new measurement have been applied as standard methods to all dynamic vision problems at UBM. Based on experience gained from 'satellite docking' [Wue 86], road vehicle guidance, and on-board autonomous aircraft landing approaches by machine vision it was realized in the mid 80ies that the joint use of dynamical models and temporal predictions for several aspects of the overall problem in parallel was the key to achieving a quantum jump in the performance level of autonomous systems based on machine vision. Beside state estimation for the physical objects observed and control computation based on these estimated
Dynamic Vision Merging Control Engineering and AI Methods
217
states it was the feedback of knowledge thus gained to the image feature extraction and to the feature aggregation level which allowed for an increase in efficiency of image sequence evaluation of one to two orders of magnitude. (See fig.2 for a graphical overview.) Following state prediction, the shape and the measurement models were exploited for determining: - viewing direction control by pointing the two- axis platform carrying the cameras; - locations in the image where information for most easy, non-ambiguous and accurate state estimation could be found (feature selection), - the orientation of edge features which allowed to reduce the number of search masks and directions for robust yet efficient and precise edge localization, - the length of the search path as function of the actual measurement uncertainty, - strategies for efficient feature aggregation guided by the idea of the 'Gestalt' of objects ,and - the Jacobian matrices of first order derivatives of feature positions relative to state components in the dynamical models which contain rich information for interpretation of the motion process in a least squares error sense, given the motion constraints, the features measured, and the statistical properties known. This integral use of 1. dynamical models for motion of and around the center of gravity taking actual control outputs and time delays into account, 2. spatial (3-D) shape models for specifying visually measurable features, 3. perspective mapping models, and 4. prediction error feedback for estimation of the object state in 3-D space and time simultaneously and in closed loop form was termed the '4-D approach'. It is far more than a recursive estimation algorithm based on some arbitrary model assumption in some arbitrary subspace or in the image plane. It is estimated from a scan of recent publications in the field, that even today most of the papers referring to 'Kalman filters' do not take advantage of this integrated use of spatio-temporal models based on physical processes. Initially, in our applications just the ego-vehicle has been assumed to be moving on a smooth surface or trajectory, with the cameras fixed to the vehicle body. In the meantime, solutions to rather general scenarios are available with several cameras spatially arranged on a platform which may be pointed by voluntary control relative to the vehicle body. These camera arrangements allow a wide simultaneous field of view, a Central area for trinocular (skew) stereo interpretation, and a small area with high image resolution for 'tele'vision. The vehicle may move in full 6 degrees of freedom; while moving,
218
Ernst D. Dickmanns
Fig. 5.2. Survey on the 4-D approach to dynamic machine vision with three major areas of activity: Object detection (central arrow upwards), tracking and state estimation (recursive loop in lower right), and learning (loop in center top), the latter two being driven by prediction error feedback. several other objects may move independently in front of a stationary background. One of these objects may be 'fixated' (tracked) by the pointing device using inertial and visual feedback signals for keeping the object (almost) centered in the high resolution image. A newly appearing object in the wide field of view may trigger a fast viewing direction change such that this object can be analysed in more detail by one of the tele-cameras; this corresponds to 'saccadic' vision as known from vertebrates and allows very much reduced data rates for a complex sense of vision. It essentially trades the need for time-sliced attention control and sampled-data based scene reconstruction against a data rate reduction of 1 to 2 orders of magnitude as compared to full resolution in the entire simultaneous field of view. The 4-D approach lends itself for this type of vision since both objectorientation and the temporal ('dynamical') models are available in the system already. This complex system design for dynamic vision has been termed EMS-vision (from Expectation-based, Multi-focal and Saccadic); it is actually being implemented with an experimental set of four miniature TV-cameras on a two-axis pointing platform dubbed 'Multi-focal active / reactive Vehicle Eye' MarVEye [Dic 95b]. In the rest of the paper, major developmental steps in the 4-D approach over the last decade and results achieved will be reviewed.
Dynamic Vision Merging Control Engineering and AI Methods
219
5.1 S t r u c t u r a l s u r v e y on t h e 4-D a p p r o a c h Figure 3 shows the main three activities running in parallel in an advanced version of the 4-D approach: 1. Detection of objects from typical collections of features not yet assigned to some object already tracked (center left, upward arrow); when these feature collections are stable over several frames, an object hypothesis has to be formed and the new object is added to the list of those regularly tracked (arrow to the right, object n). 2. Tracking of objects and state estimation is shown in the loop to the lower right in figure 3; first, with the control output chosen, a single step prediction is done in 3-D space and time, the 'imagined real world'. This step consists of two components, a) the 'where'- signal path concentrating on progress of motion in both translational and rotational degrees of freedom, and b) the 'what'- signal path dealing with object shape. (In order not to overburden the figure these components are not shown.) 3. Learning from observation is done with the same data as for tracking; however, this is not a single step loop but rather a low frequency estimation component concentrating on 'constant' parameters, or it even is an off-line component with batch processing of stored data. This is an actual construction site in code development at present which will open up the architecture towards becoming more autonomous in new task domains as experience of the system grows. Both dynamical models (for the 'where'-paxt) and shape models (for the 'what'-paxt) shall be learnable. Another component under development not detailed in figure 3 is situation assessment and behavior decision; this will be discussed in section 6. 5.2 Generic 4-D object classes The efficiency of the 4-D approach to dynamic vision is achieved by associating background knowledge about classes of objects and their behavioral capabilities with the data input.. This knowledge is available in generic form, that is, structural information typical for object classes is fixed while specific parameters in the models have to be adapted to the special case at hand. Motion descriptions for the center of gravity (the translational object trajectory in space) and for rotational movements, both of which together form the so-called 'where'-problem, axe separated from shape descriptions, called the 'what'-problem. Typically, summing and averaging of feature positions is needed to solve the where-problem while differencing feature positions contributes to solving the what-problem. 5.2.1 M o t i o n d e s c r i p t i o n . Possibilities for object trajectories are so abundant that they cannot be represented with reasonable effort. However, good
220
Ernst D. Dickmanns
models are usually available describing their evolution over time as a function of the actual state, the control- and the perturbation inputs. These socalled 'dynamical models', usually, are sets of nonlinear differential equations (x_ = f(x_, u, v', t)) with _x as the n-component state vector, u as r-component control vector and v ~ as perturbation input.
Fig. 5.3. Coarse to fine shape model of a car in rear view: a) encasing rectangle (U-shape; b) polygonal silhouette; c) silhouette with internal structure Through linearization around a nominal trajectory x~(t), locally linearized descriptions are obtained which can be integrated analytically to yield the (approximate) local transition matrix description for small cycle times T x_[(k + 1)T] = Ax[kT] + Bu[kT] + v[kT].
(5.1)
The elements of the matrices A and B are obtained from F(t) = cOf/Ox[g and G(t) = 0_f/0u[N by standard methods from systems theory. Usually, the states cannot be measured directly but through the output variables y given by
y[kT] = h(x_[kT], p, kT) + w__[kT],
(5.2)
where h may be a nonlinear mapping (see below), p are mapping parameters and w_ represents measurement noise. On the basis of eq. (1) a distinction between 'objects' proper and 'subjects' can be made: If there is no dependence on controls u in the model, or if this u(t) is input by another agent we speak of an 'object'i controlled by a subject in the latter case. If u[kT] may be activated by some internal activity within the object, be it by pre-programmed outputs or by results obtained from processing of measurement data, we speak of a 'subject'. 5.2.2 S h a p e a n d f e a t u r e d e s c r i p t i o n . With respect to shape, objects and subjects are treated in the same fashion. Only rigid objects and objects consisting of several rigid parts linked by joints have been treated; for elastic and plastic modeling see [DeM 96]. Since objects may be seen at different ranges, the appearance in the image may vary considerably in size. At large ranges the 3-D shape of the object, usually, is of no importance to the observer, and the cross-section seen contains most of the information for tracking. However, this cross-section depends on the angular aspect conditions; therefore, both
Dynamic Vision Merging Control Engineering and AI Methods
221
coarse-to-fine and aspect-dependent modeling of shape is necessary for efficient dynamic vision. This will be discussed briefly for the task of perceiving road vehicles as they appear in normal road traffic. Coarse-to-fine shape models in 2-D: Seen from behind or from the front at a large distance, any road vehicle may be adequately described by its encasing rectangle; this is convenient since this shape just has two parameters, width b and height h. Absolute values of these parameters are of no importance at larger distances; the proper scale may be inferred from other known objects seen, like road or lane width at that distance. Trucks (or buses) and cars can easily be distinguished. Our experience tells that even the upper limit and thus the height of the object may be omitted without loss of functionality (reflections in this spatially curved region of the car body together with varying environmental conditions may make reliable tracking of the upper body boundary very difficult); thus, a simple U-shape of unit height (corresponding to about 1 m turned out to be practical) seems to be sufficient until 1 to 2 dozen pixels can be found on a line crossing the object in the image. Depending on the focal length used, this corresponds to different absolute distances. Fig. 4a shows this shape model. If the object in the image is large enough so that details may be distinguished reliably by feature extraction, a polygonal shape approximation as shown in fig. 4b or even with internal details (fig. 4c) may be chosen; in the latter case, area-based features like the license plate, the tires or the signal light groups (usually in yellow or reddish color) may allow more robust recognition and tracking. If the view is from an oblique direction, the depth dimension (length of the vehicle) comes into play. Even with viewing conditions slightly off the axis of symmetry of the vehicle observed, the width of the car in the image will start increasing rapidly because of the larger length of the body and due to the sine-effect in mapping. Usually, it is impossible to determine the lateral aspect angle, body width and -length simultaneously from visual measurements; therefore, switching to the body diagonal as a shape representation has proven to be much more robust and reliable in real-world scenes [Scd 94]. 5.3 S t a t e e s t i m a t i o n
The basic approach has been described many times (see [Wue 86; Dic 87; Dic 92; Beh 96; Tho 96]) and has remained the same for visual relative state estimation over years by now. However, in order to be able to better deal with the general case of scene recognition under (more strongly) perturbed ego-motion, an inertially based component has been added [Wea 96, Wer 97]. This type of state estimation is not new at all if compared to inertial navigation, e.g. for missiles; however here, only very inexpensive accelerometers and angular rate sensors are being used. This is acceptable only because the resulting drift problems are handled by a visual state estimation loop running in parallel, thereby resembling the combined use of (relatively poor) inertial
222
Ernst D. Dickmanns
signals from the vestibular apparatus and of visual signals in vertebrate perception. Some of these inertial signals may also be used for stabilizing the viewing direction with respect to the stationary environment by direct negative feedback of angular rates to the pointing device carrying the cameras. This feedback actually runs at very high rates in our systems (500 Hz, see
[San 95]). 5.3.1 I n e r t i a l l y b a s e d e g o - s t a t e e s t i m a t i o n (IbSE). The advantage of this new component is three-fold: 1. Because of the direct encoding of accelerations along, and rotational speed components around body fixed axes, time delays are negligible. These components can be integrated numerically to yield predictions of positions. 2. The quantities measured correspond to the forces and moments actually exerted on the vehicle including the effects of perturbations; therefore, they are more valuable than predictions from a theoretical model disregarding perturbations which are unknown, in general. 3. If good models for the eigen-behavior are available, the inertial measurements allow to estimate parameters in perturbation models, thereby leading to deeper understanding of environmental effects. 5.3.2 D y n a m i c vision. With respect to ego-state recognition, vision now has reduced but still essential functionality. It has to stabilize longterm interpretation relative to the stationary environment, and it has to yield information on the environment, like position and orientation relative to the road and road curvature in vehicle guidance, not measurable inertially. With respect to other vehicles or obstacles, the vision task also is slightly alleviated since the high-frequency viewing direction component is known now; this reduces search range required for feature extraction and leads to higher efficiency of the overall system. These effects can only be achieved using spatio-temporal models and perspective mapping, since these items link inertial measurements to features in the image plane. With different measurement models for all the cameras used, a single object model and its recursive iteration loop may be fed with image data from all cameras relevant. Jacobian matrices now exist for each object/sensor pair. The nonlinear measurement equation (2) is linearized around the predicted nominal state x N and the nominal parameter set PN yielding (without the noise term)
y[kT] = YN[kt] + 5y[kt] = h(x__g[kT],pg , kT) + C~5___x+ CpSp,
(5.3)
where C, = Oh/Oxlg and Cp = Oh/Oplg are the Jacobian matrices with respect to the state components and the parameters involved. Since the first terms to the right hand side of the equality sign are equal by definition, eq. (3) may be used to determine 5x and 519 in a least squares sense from 5y, the
Dynamic Vision Merging Control Engineering and AI Methods
223
prediction error measured (observability given); this is the core of recursive estimation. 5.4 S i t u a t i o n a s s e s s m e n t
For each object an estimation loop is set up yielding best estimates for the relative state to the ego-vehicle including all spatial velocity components. For stationary landmarks, the velocity is the negative of ego-speed, of course. Since this is known reliably from conventional measurements, the distance to the landmark can be determined even with monocular vision exploiting motion stereo [Hoc 94, Tho 96, Mill 96]. With all this information available for the surrounding environment and the most essential objects in it, an interpretation process can evaluate the situation in a task context and come up with a conclusion whether to proceed with the behavioral mode running or to switch to a different mode. Fast in-advance simulations exploiting dynamical models and alternative stereotypical control inputs yield possible alternatives for the near-term evolution of the situation. By comparing the options or by resorting to precomputed and stored results, these decisions are being made.
6. G e n e r a t i o n
of behavioral
capabilities
Dynamic vision is geared to closed-loop behavior in a task context; the types of behavior of relevance, of course, depend on the special task domain. The general aspect is that behaviors are generated by control output. There are two basically different types of control generation: 1. Triggering the activation of (generically) stored time histories, so-called feed-forward control, by events actually observed, and 2. gearing actual control to the difference between desired and actual state of relevant systems, so-called feedback control. In both cases, actual control parameters may depend on the situation given. A very general method is to combine the two given above (as a third case in the list), which is especially easy in the 4-D approach where dynamical models are already available for the part of motion understanding. The general feed-forward control law in generic form is U(T) -~ g(PM' TM)'
with 0 < 7 = t - tTri9 < (TM),
(6.1)
where PM may contain averaged state components (like speed). A typical feed-forward control element is the steer control output for lane change: In a generic formulation, for example, the steer rate A-dot is set in five phases during the maneuver time TM; the first and the final two phases of duration Tp each, consist of a constant steer rate, say R. In the second and
224
Ernst D. Dickmanns
fourth phase of same duration, the amplitude is of opposite sign to the first and last one. In the third phase the steer rate is zero; it may be missing at all (duration zero). The parameters R, TM, Tp have to be selected such that at (TM + ATD) the lateral offset is just one lane width with vehicle heading the same as before; these parameters, of course, depend on the speed driven. Given this idealized control law, the corresponding state component time histories X__c(t) for 0 < T t--tTrig < (7-M~-ATD) can be computed according to a good dynamical model; the additional time period ATD at the end is added because in real dynamical maneuvers the transition is not completed at the time when the control input ends. In order to counteract disturbances during the maneuver, the difference Ax(t) = Xc(T ) --X_(T) may be used in a superimposed state feedback controller to force the real trajectory towards the ideal one. The general state feedback control law is =
~t_(T) : --KT Ax(T),
(6.2)
with K an r by n gain matrix. The gain coefficients may be set by pole placement or by a Riccati design (optimal linear quadratic controller) well known in control engineering [Kai 80]. Both methods include knowledge about behavioral characteristics along the time axis: While pole placement specifies the eigenvalues of the closed loop system, the Riccati design minimizes weighted integrals of state errors and control inputs. The simultaneous use of dynamical models for both perception and control and for the evaluation process leading to behavior decision makes this approach so efficient. Figure 5 shows the closed-loop interactions in the overall system. Based on object state estimation (lower left corner) events are detected (center left) and the overall situation is assessed (upper left). Initially, the upper level has to decide which of the behavioral capabilities available are to be used: Feed-forward, feedback, or a superposition of both; lateron, the feedback loops activated are running continuously (lower part in fig. 5 with horizontal texture) without intervention from the upper levels, except for mode changes. Certain events also may trigger feed-forward control outputs directly (center right). Since the actual trajectory evolving from this control input may be different from the nominal one expected due to unforseeable perturbations, commanded state time histories xc(t) are generated in the block object 'state prediction' (center of fig.5, upper right backward corner) and used as reference values for the feedback loop (arrow from top at lower center). In this way, combining feed-forward direct control and actual error feedback, the system will realize the commanded behavior as close as possible and deal with perturbations without the need for replanning on the higher levels. All, that is needed for mission performance of any specific system then is a sufficiently rich set of feed-forward and feedback behavioral capabilities.
Dynamic Vision Merging Control Engineering and AI Methods
225
Fig. 6.1. Knowledge based real-time control system with three hierarchical levels and time-horizons. These have to be activated in the right sequence such that the goals are achieved in the end. For this purpose, the effect of each behavioral capability has to be represented on the upper decision level by global descriptions of their effects: 1. For feed-forward behaviors with corrective feedback superimposed (case 3 given above) it is sufficient to just represent initial and final conditions including time needed; note that this is a quasi-static description as used in AI-methods. This level does not have to worry a b o u t real-time dynamics, being taken care. off by the lower levels. It just has to know, in which situations these behavioral capabilities may be activated with which parameter set. 2. For feedback behaviors it is sufficient to know when this mode may be used; these reflex-like fast reactions may run over unlimited periods of time if not interrupted by some special event. A typical example is lane following in road vehicle guidance; the integral of speed then is the distance traveled, irrespective of the curvatures of the road. These values are given in information systems for planning, like maps or tables, and can be used for checking mission progress on the upper level. Performing more complex missions on this basis has just begun.
226
Ernst D. Dickmanns
Fig. 7.1. The autonomous vehicle VaMP of UBM 7. E x p e r i m e n t a l
results
The autonomous road vehicle VaMP (see figure 6) and its twin VITA II of Daimler-Benz have shown remarkable performance in normal freeway traffic in France, Germany and Denmark since 1994. VaMP has two pairs of bifocal camera sets of focal lengths 7.5 and 24 mm; one looks to the front, the other one to the rear. With 320 by 240 pixels per image this is sufficient for observing road and traffic up to about 100m in front of and behind the vehicle. With its 46 transputers for image processing it has been able in 1994 to recognize road curvature, lane width, number of lanes, type of lane markings, its own position and attitude relative to the lane and to the driveway, and the relative state of up to ten other vehicles including their velocity components, five in each hemisphere. At the final demonstration of the EUREKA-project Prometheus near Paris, VaMP has demonstrated its capabilities of free lane driving and convoy driving at speeds up to 130 k m / h in normally dense three-lane traffic; lane changing for passing and even the decision whether lane changes were safely possible have been done autonomously. The human safety pilot just had to check the validity of the decision and to give a goahead input [Dic 95a]. In the meantime, transputers had been replaced by PowerPCs MPC 601 with an order of magnitude more computing power. A long range trip over about 1600 km to a project meeting in Odense, Denmark in 1995 has been performed in which about 95% of the distance could be traveled fully automatically, in both longitudinal and lateral degrees of freedom. Maximum speed on a free stretch in the northern German plain was 180 k m / h with the human safety driver in charge of long distance obstacle detection.
Dynamic Vision Merging Control Engineering and AI Methods
227
Since only black-and-white video signals have been evaluated with edge feature extraction algorithms, construction sites with yellow markings on top of the white ones could not be handled; also, passing vehicles cutting into the own lane very near by posed problems because they could not be picked up early enough due to lack of simultaneous field of view, and because monocular range estimation took too long to converge to a stable interpretation without seeing the contact point of the vehicle on the ground. For these reasons, the system is now being improved with a wide field of view from two divergently oriented wide angle cameras with a central region of overlap for stereo interpretation; additionally, a high resolution (3-chip) color camera also covers the central part of the stereo field-of-view. This allows for trinocular stereo and area-based object recognition. Dual-PentiumPro processors now provide the processing power for tens of thousands of mask evaluations with CRONOS per video cycle and processor. VaMoRs, the 5-ton van in operation since 1985 which has demonstrated quite a few 'firsts' in autonomous road driving, has seen the sequence of microprocessors from Intel 8086, 80x86, via transputers and PowerPCs back to general purpose Intel Pentium and PentiumPro. In addition to early highspeed driving on freeways [DiZ 87] it has demonstrated its capability of driving on state and on minor unsealed roads at speeds up to 50 k m / h (1992); it is able to recognize hilly terrain and to estimate vertical road curvature in addition to the horizontal one. Recognizing cross-roads of unknown width and angular orientation has been demonstrated as well as turning off onto these roads, even with tight curves requiring an initial maneuver to the opposite direction of the curve [Mill 96; DiM 95]. These capabilities will also be considerably improved by the new camera arrangement with a wide simultaneous field of view and area based color image processing. Performing entire missions based on digital maps has been started [Hoc 94] and is alleviated now by a GPS-receiver in combination with inertial state estimation recently introduced [Mill 96; Wet 97]. The vehicles VaMoRs and VaMP together have accumulated a record of about 10 000 km in fully autonomous driving on many types of roadways.
8. C o n c l u s i o n s The 4-D approach to dynamic machine vision developed along the lines laid out by cybernetics and conventional engineering long time ago does seem to satisfy all the expectations it shares with 'Artificial Intelligence'- and 'Neural Net'-approaches. Complex perception and control processes like ground vehicle guidance under diverse conditions and in rather complex scenes have been demonstrated as well as maneuver- and mission-control in full six degrees of freedom. The representational tools of computer graphics and -simulation
228
Ernst D. Dickmanns
have been complemented for dealing with the inverse problem of computer vision. Computing power is arriving now for handling real-word problems in realtime. Lack of robustness encountered up to now due to black-and-white as well as edge-based image understanding can now be complemented by areabased representations including color and texture, both very demanding with respect to processing power. Taking advantage of well suited methods in competing approaches and combining the best of every field in a unified overall approach will be the most promising way to go. Expectation based, multi-focal, saccadic (EMS-) vision contains some of the most essential achievements of vertebrate eyes in the biological realm, realized, however, in a quite different way.
References [Beh 96] R. Behringer: Visuelle Erkennung und Interpretation des Fahrspurverlaufes durch Rechnersehen f/Jr ein autonomes Straenfahrzeug. PhD thesis, UniBwM, LRT, 1996. [DDi 97] Dirk Dickmanns: Rahmensystem f/Jr visuelle Wahrnehmung ver~inderlicher Szenen durch Computer. PhD thesis, UniBwM, INF, 1997. [DeM 96] D. DeCarlo and D. Metaxas: The Integration of Optical Flow and Deformable Models with Applications to Human Face Shape and Motion Estimation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, June 1996, pp 231-238. [Dic 87] E.D. Dickmanns: 4-D-Dynamic Scene Analysis with Integral SpatioTemporal Models. 4th Int. Symposium on Robotics Research, Santa Cruz, 1987. [Dic 89] E.D. Dickmanns: Subject-Object Discrimination in 4-D Dynamic Scene Interpretation Machine Vision. Proc. IEEE- Workshop on Visual Motion, Newport Beach, 1989, pp 298-304. [Dic 92] E.D. Dickmanns: Machine Perception Exploiting High-Level Spatio-Temporal Models. AGARD Lecture Series 185 'Machine Perception', Hampton, VA, Munich, Madrid, Sept./Oct. 1992. [Dic 95a] E.D. Dickmanns: Performance Improvements for Autonomous Road Vehicles. Int. Conference on Intelligent Autonomous Systems (IAS-4), Karlsruhe, 1995. [Dic 95b] E.D. Dickmanns: Road vehicle eyes for high precision navigation. In Linkwitz et al. (eds): High Precision Navigation. D/immler Verlag, Bonn, 1995, pp. 329-336. [DiG 88] E.D. Dickmanns, V. Graefe: a) Dynamic monocular machine vision. Machine Vision and Applications, Springer International, Vol. 1, 1988, pp 223-240. b) Applications of dynamic monocular machine vision. (ibid), 1988, pp 241-261. [DiM 95] E.D. Dickmanns and N. M/iller: Scene Recognition and Navigation Capabilities for Lane Changes and Turns in Vision-Based Vehicle Guidance. Control Engineering Practice, 2nd IFAC Conf. on Intelligent Autonomous Vehicles-95, Helsinki 1995. [DiZ 87] E.D. Dickmanns and A. Zapp: Autonomous High Speed Road Vehicle Guidance by Computer Vision. 10th IFAC World Congress Munich, Preprint Vol. 4, 1987, pp 232-237. [Hoc 94] C. Hock: Wissensbasierte Fahrzeugf/ihrung mit Landmarken ffir autonome Roboter. PhD thesis, UniBwM, LRT, 1994.
Dynamic Vision Merging Control Engineering and AI Methods
229
[Kai 80] T. Kailath: Linear Systems. Prentice-Hall, Inc., Englewood Cliffs, N.J., 1980. [Lue 64] D.G. Luenberger: Observing the state of a linear system. IEEE Trans on Mil. Electronics 8, 1964, pp 290-293. [MeD 83] H.G. Meissner and E.D. Dickmanns: Control of an Unstable Plant by Computer Vision. In T.S. Huang (ed): Image Sequence Processing and Dynamic Scene Analysis. Springer-Verlag, Berlin, 1983, pp 532-548. [Miil 96] Mfiller N.: Autonomes ManSvrieren und Navigieren mit einem sehenden Straenfahrzeug. PhD thesis, UniBwM, LRT, 1996. [Scn 95] J. Schiehlen: Kameraplattformen fiir aktiv sehende Fahrzeuge. PhD thesis, UniBwM, LRT, 1995. [Tho 96] F. Thomanek F.: Visuelle Erkennung und Zustandsschs yon mehreren Straenfahrzeugen zur autonomen Fahrzeugfiihrung. PhD thesis, UniBwM, LRT, 1996. [Wea 96] S. Werner, S. Ffirst, D. Dickmanns, and E.D. Dickmanns: A vision-based multi-sensor machine perception system for autonomous aircraft landing approach. Enhanced and Synthetic Vision, AeroSense '96, Orlando, FL, April 1996. [Wer 97] ] S. Werner: Maschinelle Wahrnehmung fiir den bordautonomen automatischen Hubschauberflug. PhD thesis, UniBwM, LRT, 1997. [Wue 83] H.-J. Wuensche: Verbesserte Regelung eines dynamischen Systems durch Auswertung redundanter Sichtinformation unter Berficksichtigung der Einflfisse verschiedener Zustandssch~itzer und Abtastzeiten. Report HSBw/LRT/WE 13a/IB/832, 1983. [Wue 86] H.-J. Wuensche: Detection and Control of Mobile Robot Motion by RealTime Computer Vision. In N. Marquino (ed): Advances in Intelligent Robotics Systems. Proceedings of the SPIE, Vol. 727, 1986, pp 100-109.
R e a l - T i m e Pose Estimation and Control for Convoying Applications R. L. Carceroni, C. Harman, C. K. Eveland, and C. M. Brown Department Computer Science University of Rochester Rochester, NY 14627 USA
1. I n t r o d u c t i o n One of the main obstacles to the practical feasibility of many computer vision techniques has been the necessity of using expensive specialized hardware for low-level image processing in order to achieve real-time performance. However, gradual improvements in the architectural design and in the manufacturing technology of general-purpose microprocessors have made their usage for low-level vision more and more attractive. In this chapter, we demonstrate the real-time feasibility of a tracking system for smart vehicle convoying that uses basically a dual 133 MHz Pentium board, an M68332 micro-controller and a commercial PCI frame grabber. The task at hand consists of enabling an autonomous mobile robot with a single off-the-shelf camera to follow a target placed on the posterior part of another mobile platform, controlled manually. The key ideas used to achieve efficiency are quite traditional concepts in computer vision. In the low-level image processing front, we use multiresolution techniques to allow the system to locate quickly the regions of interest on each scene and then to focus its attention exclusively on them, in order to obtain accurate geometrical information at relatively low computational cost. In the higher-level processes of geometrical analysis and tracking, the key idea is to use as much information available a p r i o r i about the target as possible, in order to develop routines that combine maximum efficiency with high precision, provided that their specialized geometry and dynamics assumptions are met. Finally, for the control itself, we use a two-level strategy that combines error signals obtained from the analysis of visual data and from odometry. So, while we do introduce some novel formulations, especially in the context of geometrical analysis of the scenes, the main goal of this work is clearly to demonstrate that by carefully putting together several well-established concepts and techniques in the area of computer vision, it is possible to tackle the challenging problem of smart vehicle convoying with low-cost equipment. This chapter is an abbreviation of [4], which is less terse and contains more experimental results.
Real-Time Pose Estimation and Control for Convoying Applications
231
2. B a c k g r o u n d The task of tracking a single target can be divided in two parts: acquisition and tracking proper (below, simply tracking) [6]. Acquisition involves the identification and localization of the target, as well as a rough initial estimation of its pose (position and orientation), velocities and possibly some other state variables of interest. This phase is in some aspects quite similar to the problem of object recognition. Usually, generality is more important at this point than in the subsequent tracking phase, because in several practical applications, many different targets of interest may appear in the field-ofview of the tracking system and thus it is not possible to use techniques that work only for one particular type of target. The information obtained in the acquisition phase is then used to initiate the tracking phase, in which the target's state variables are refined and updated at a relatively high frequency. In this report we argue that in this phase, after the target has been identified and its initial state has been properly initialized, all the specific information available about its geometry and dynamics should be exploited in the development of specialized routines that are appropriate for real-time usage and, still, require only inexpensive general-purpose hardware. As suggested by Donald Gennery [6], the tracking phase can be divided into four major subtasks: (a) P r e d i c t i o n : given a history of (noisy) measurements of the target state variables, extrapolate the values of these variables in the next sampling instant; (b) P r o j e c t i o n : given the predicted state, simulate the imaging process to determine the appearance of the target in the next image; (c) M e a s u r e m e n t : search for the expected visible features in the next image; (d) B a c k - p r o j e c t i o n : compute the discrepancy between the actual and the predicted image measurements and modify the estimated state accordingly (ultimately, some sort of back-projection from the 2-D image plane to the 3-D scene space is needed to perform this task). In our tracking system, one of the steps in which we exploit most heavily the availability of a priori information about the target in order to improve efficiency and accuracy is Back-projection. We make use of the fact that our target is a rigid object composed by points whose relative 3-D positions are known a priori. The problem of recovering the pose of a 3-D object from a single monocular image, given a geometrical model for this object, has been heavily studied in the last two decades or so. The solutions proposed in the literature can be classified, according to the nature of the imaging models and mathematical techniques employed, as: analytical perspective, affine and numerical perspective [3]. As explained in [4], only the latter class is appropriate for the kind of application that we have in mind. Within this class, we focus on an approach suggested recently DeMenthon and Davis [5]. It consists of computing an initial estimate for the pose with a weak perspective camera model and then refining this model numerically, in order to account for the perspective effects in the image. The key idea is
232
R.L. Carceroni, C. Harman, C. K. Eveland, and C. M. Brown
to isolate the non-linearity of the perspective projection equations with a set of parameters that explicitly quantify the degree of perspective distortion in different parts of the scene. By artificially setting these parameters to zero, one can then generate an affine estimate for the pose. Then, the resulting pose parameters can be used to estimate the distortion parameters and this process can be iterated until the resulting camera model (presumably) converges to full perspective. Oberkampf et al [13] extend DeMenthon-Davis's original algorithm to deal with planar objects (the original formulation is not able to handle that particular case) and Horaud et al [7] propose a similar approach that starts with a paraperspective rather than a weak perspective camera model. The main advantage of this kind of approach over other numerical approaches based on the use of derivatives of the imaging transformation [9, 10, 8, 14] is its efficiency. Like the derivative-based techniques, each iteration of the algorithms based on initial affine approximations demands the resolution of a possibly over-constrained system of linear equations. However, in the latter methods, the coefficient matrix of this system depends only on the scene model and thus its (pseudo) inverse can be computed off-line, while the optimization-based techniques must necessarily perform this expensive operation at every single iteration [5]. However, the kind of solution mentioned so far is too general for our intended application, in which the motion of the target is roughly constrained to a unique plane. Most of the model-based pose recovery algorithms available in the literature do not impose any restriction on the possible motions of the target and thus use camera models with at least six DOF, such as the perspective, weak perspective and paraperspective models. Wiles and Brady [15] propose some simpler camera models for the important problem of smart vehicle convoying on highways. In their analysis, they assume that a camera rigidly attached to a certain trailing vehicle is used to estimate the structure of a leading vehicle, in such a way that the paths traversed by these two vehicles are restricted to a unique "ground" plane. Clearly, the application-specific constraints reduce the number of D O F in the relative pose of the leading vehicle to three. Because the camera does not undergo any rotation about its optical axis, the x axis of the camera frame can be defined to be parallel to the ground plane. Furthermore, the tilt angle between the camera's optical axis and the ground plane (a) is fixed and can be measured in advance. Thus, the general perspective camera can be specialized to a model called perspective Ground Plane Motion (GPM) camera, whose extrinsic parameter matrix is much simpler than that of a s i x - D O F perspective model. We take this idea to an extreme. We not only simplify even more the model proposed by Wiles and Brady with the assumption that the image plane is normal to the ground plane (a -- 0), but we also use a specially-engineered symmetrical pyramidal target, in order to make the problem of inverting the
Real-Time Pose Estimation and Control for Convoying Applications
233
perspective transformation performed by the camera as simple as possible. Inspired by the work of DeMenthon and Davis [5], we adopt a solution based on the numerical refinement of an initial weak perspective pose estimate, in order to obtain accuracy at low computational cost. But rather than starting from scratch and iterating our numerical solution until it converges for each individual frame, we interleave this numerical optimization with the recursive estimation of the time series equivalent to the state of the target, as suggested by Donald Gennery [6]. So, only one iteration of the numerical pose recovery is performed per frame and the temporal coherence of the visual input stream is used in order to keep the errors in the estimates for the target state down to a sufficiently precise level.
3. R e a l - T i m e
Pose Estimation
To some extent all computer vision is "engineered" to remove irrelevant variation that is difficult to deal with. Still, our engineering takes a fairly extreme form in that we track a specially-constructed 3-D calibration object, or target, attached to the leading robot. This target consists of two planes parallel with respect to each other and orthogonal to the ground plane, kept in fixed positions with respect to the mobile robot by a rigid rod, as shown in Fig. 3.1(a). The plane closer to the center of the leading robot (typically further away from the camera) contains four identical circles whose centroids define a rectangle. The other plane (more distant from the leading robot) contains a unique circle whose centroid's orthogonal projection on the plane closer to the leading robot lies on the axis of vertical symmetry of the four-circle rectangle (Fig. 3.1(b)). From the point of view of our tracking algorithm, the state of this target is described with respect to a coordinate system attached to the camera, whose x and y axes correspond to the horizontal (rightward) and vertical (downward) directions on the image plane, respectively, and whose z axis corresponds to the optical axis of the camera (forward). Due to the groundmotion constraint, the target has only 3 DOF with respect to the camera. The state variables used to encode these DOF are: the distances between the camera's optical center and the centroid of the target's rectangle along the x and z axes, denoted by t~ and tz, respectively, and the counterclockwise (as seen from the top) angle between the x axis and the plane that contains the rectangle, denoted by ~ (as illustrated in Fig. 3.1(a)). At each step of the tracking phase, the tracker initially performs a a pr/ori Prediction of the state of the target, based uniquely on the history of the values for its state variables. Since our mobile robots can stop and turn quite sharply, we perform this prediction with a simple velocity extrapolation for each state variable, because under these circumstances of a highlymaneuverable target and rather slow update rates, more complex filtering is "impractical and destabilizing.
234
R.L. Carceroni, C. Harman, C. K. Eveland, and C. M. Brown
zi ..... Close plane ' ~
12.8" Distant plane h = 9.8"
tz
Optical center tx
X
w = 20.0"
Fig. 3.1. Geometry (a) of the target placed on the posterior p(arblof the leading vehicle: (a) top view; and (b) frontal view. The predicted values for the state variables are used to compute the appearances, on the image plane, expected for the five circles that compose the target. This corresponds to the Projection step, according to the outline presented in Section 2., and amounts to projecting the known geometry of the target, according to our simplified perspective G P M camera model. Using the fact that the target is symmetrical, one can express the coordinates of the four circle centroids in the close plane as [..~ ~, w • h 0] and the coordinates of the centroid in the distant plane as [0, he, -/], where w, h and l are defined in Fig. 3.1 (for details, see [4]). Let ~)(i) and v (i) denote the estimated and measured values for state variable v at step i, respectively, where v is one of tz, tz and 0. According to our imaging model, the projection equation that yields the image coordinates of an arbitrary point i, [ui, vii T, as a function of its coordinates on the model reference frame, [xi, Yi, zi] T, is: [ui, vi, 1]T = A Mint Mext [xi, Yi, zi, 1]T,
(3.1)
where the matrix of intrinsic camera parameters, Mint, (calibrated a priori) and the matrix of extrinsic camera parameters, Mext, (estimated by the tracker) are given by:
Mint =
fv 0
]
,
and
Mext =
rCo0
1 0 ho [.sinO0 cosO [~
9
These predicted appearances are then used in the Measurement phase, which corresponds to the low level processing of the current input image,
Real-Time Pose Estimation and Control for Convoying Applications
235
described in the Section 5.. Finally, the low-level image processing module returns the positions, measured in the image plane, for the apparent centroids of the target circles, which are used in the Reprojection, yielding the measured state of the target in the current step of the tracking phase. Let the apparent centroids of the top left, top right, bottom left, bottom right and central circles be denoted by [utt, Vtl]T, [Utr, Vtr]T, [Ubl,Vbl]T, [Ubr, Vbr]T and [We,Vc]T, respectively. In order to simplify the derivation of the equations that yield the measured state variables G, tz and 8, we define the image measurements mx, mz and m0, as follows:
4
-Uo,
2
, uc-uo
9
(3.2) By replacing the predicted state variables in Eq. (3.1) with their measured counterparts and substituting the resulting expressions (as well as the centroid coordinates in the model frame) into Eq. (3.2), one can express each image measurement above as a function of the state variables. Each resulting expression involves at least two of the three state variables G, G and 8, as shown in [4]. In order to perform the Back-projection, we need to solve for each individual state variable. A possible approach would be to try to combine the different equations analytically, but due to the nonlinearity of the camera model, this approach is likely to result in ambiguity problems and poor error propagation properties. Instead, we exploit the temporal coherence of the sequence of images through a numerical algorithm that is iterated for successive input images, in order to recover precise values of the pose parameters. The expression which defines mz as a function of the state variables involves only G and O. Instead of trying to solve for both unknowns at the same time, we use the measured value of 8 from the previous step of the tracking process in order to get the value for t~ at the current step:
=
fvh + r
2 + (mzw sinO('-l)) 2 2 mz
(3.3)
Similarly, the expression for rn~ can be used to solve for t~ as a function of the current value of tz (just computed above) and the previous value of 8: t( 0
rn~ t( 0 +
=
w2sin8(i-1) ( m~ sin 8 (i-1) +COS 8 (i-1) ) 44 T
(3.4)
Finally, the me equation can be solved directly for 8, after the current values of G and t, are both known, yielding:
{ so k - mo + 8(') = sin-1 \ f~;m~
- ks I me t(2 ] , where: k = l
t(:)
(3.5) A careful derivation of the equations above is presented in [4].
236
R.L. Carceroni, C. Harman, C. K. Eveland, and C. M. Brown
Eqs. (3.3) to (3.5) allow one to perform pose recovery recursively, using the solution found in the previous step as an initial guess for the unknown pose at the current step. However, we still need an initial guess for 0 at the first time that Eqs. (3.3) and (3.4) are to be used. Our choice is to set 0 (~ = 0, reducing the equations mentioned above to:
= Sv__hh, and mz
1) = m dl) .fu
(3.6)
Notice t h a t this amounts to a weak perspective approximation, since 0 -- 0 implies that all the four vertices of the target rectangle that is used to recover tz and t~ are at the same depth with respect to the camera. So, in this sense, our pose recovery algorithm is inspired in the scheme proposed by DeMenthon and Davis [5], since it starts with a weak perspective approximation (at least for translation recovery) and then refines the projective model iteratively to recover a fully perspective pose. As we mentioned in Section 2., the basic differences are that we use a much more specialized camera model with only three DOF (instead of six), and we embed the refinement of the projective model in successive steps of the tracking phase, rather than starting it all over from scratch and iterating until convergence for each frame. This is a way of exploiting the temporal coherence of the input images to achieve relatively precise pose estimates at low computational cost.
4. Comparing
Our
A p p r o a c h to t h e V S D F
In order to evaluate the benefits t h a t the use of strong application-specific constraints brings to our formulation, we used synthetic d a t a to compare it against a more generic motion recovery tool, known as the Variable State Dimension Filter (VSDF) [11, 12]. Contrary to our technique, the VSDF does not make any prior assumptions about the nature of the rigid motion in the input scenes. It is a framework for optimal recursive estimation of structure and motion based solely on the assumption that image measurement errors are independent and Gaussian distributed. The VSDF can be used with different measurement equations corresponding to distinct camera models (such as perspective [12], affine [11] or projective [12]). In the present work we focus on its affine variant (the reasons for this choice are presented in [4]). Since the type of structure recovered by the affine mode of the VSDF is non-metric, the motion recovered with it is not necessarily rigid, contrary to our own approach. So, to be fair, we augment the affine VSDF with p r e - and post-processing steps aimed at ensuring rigidity, as explained in [4]. We also include in the comparison a third method, which is a simplification of our original approach, for the weak perspective case [4].
Real-Time Pose Estimation and Control for Convoying Applications
237
In order to simulate the motion of the leading platform, we use a realistic model described in [4]. For the motion of the trailing platform, we use an "ideal" controller, t h a t repeats the same sequence of poses (positions and orientations) of the leading platform with respect to the ground plane, with a delay of At frames - - where A t is a simulation parameter t h a t is varied in some experiments and kept fixed in others. Several types of imprecisions occurring in practice are taken into account in the simulation: errors in the measured 3-D structure of the target, misalignment of the camera with respect to the ground plane, imprecisions in camera calibration, and Gaussian noise in the imaging process. For a detailed description of our experimental set-up, see [4]. Initially, we ran some experiments to determine which of the three methods under consideration is the most accurate in "general" cases. In this round of tests, At was varied and all the different types of imprecisions were kept fixed. The averages (across sequences of 1,800 frames) of the absolute differences between the estimated and true values for each state variable, for each method, are shown in Fig. 4.1 (standard deviations are not shown here, but are qualitatively similar - - see [4]). Average X Error
~~
Average Angle Error
Average Z Error 5(] ~40
9
._
i~ 10 Q. x
~
s
5
#/
s S
//
~5
s 0
25
50 75 Delay (frames)
I
~10 a.
tU 0 100 25
"~30
/
.s s 'S I ~ ,, . , " 50 75 Delay (frames)
/s
100
~"J
~, 20
t" ~
~ 10
.,,,"
25
.~'~"
50 75 Delay (frames)
100
Fig. 4.1. Sensitivity in errors of the estimated pose parameters with respect to the delay between the leading and the trailing platforms. Solid, dash-dotted and dashed lines represent, respectively, planar perspective pose recovery, planar weakperspective pose recovery and affine pose recovery with the VSDF. We also ran tests to check the sensitivity of individual techniques with respect to the different types of imprecisions. In these experiments At was kept fixed and only one type of imprecision was varied at a time. T h e sensitivity of the techniques with respect to the magnitude of the misalignment between the camera and the ground plane is illustrated in Fig. 4.2 (the error metrics are the same). Other sensitivity experiments are reported in [4]. The results obtained can be summarized as follows: the planar perspective technique that we suggest seems to be much more accurate t h a n the affine mode of the Variable State Dimension Filter (VSDF) in the specific domain of our intended application, b o t h for translation and for rotation recovery. This is not surprising, since the VSDF is a much more generic technique that does not exploit domain-specific constraints to achieve greater stability. A more interesting result is the fact that our approach's superiority can still be verified even when some of its domain-specific assumptions are partially
238
R.L. Carceroni, C. Harman, C. K. Eveland, and C. M. Brown Average X Error
Average Angle Error
Average Z Error
~-8
~,8
26
26
25
i
9
a~ $ 20
p
j-
~4
n x
ii ~*
o. N
g
g2 o w 0
100 101 102 10 -1 Misalignment Angle (degrees)
m 0 1 0 -1
100 101 102 Misalignment Angle (degrees)
o ____JF
1, -1 10 0 101 10 e Misalignment Angle (degrees)
Fig. 4.2. Sensitivity in errors of the estimated pose parameters with respect to the standard deviation in the angle between the optical axis and the ground plane. Conventions are the same than in previous figure. violated. In addition, when compared to the simpler planar approach that does not take into account the effects of perspective distortion, our technique yields significantly more accurate rotation estimates. Both planar approaches are roughly equivalent for translation estimation. We also measured averages and standard deviations of execution times for all the experiments performed. The two planar techniques have minimal computational requirements, with average execution times per frame of about 20 microseconds and standard deviations of 3 microseconds. The affine mode of the VSDF, on the other hand, can barely be used in real-time since it takes on average 51 milliseconds per frame to execute, with a standard deviation of 3 milliseconds. In practice, since the same computational resources used for pose estimation are also support low-level vision and control in our implementation, the use of the VSDF would constrain the frame rate of our system to something in the order of 10 Hz, at best.
5. E f f i c i e n t L o w L e v e l I m a g e
Processing
In our real-world implementation, image acquisition is performed with a Matrox Meteor frame grabber. In order to achieve maximum efficiency, this device is used in a mode that reads the images directly to the memory physically addressed by the Pent]a, using multiple preallocated buffers to store successive frames. This way, the digitized images can be processed directly in the memory location where they are originally stored, while the following frames are written to different locations. The initial step of the low-level image processing is the construction of a multi-resolution pyramid. In the current implementation, we start with digitized images of size 180 • 280. On each of the lower resolution levels, each image is obtained by convolving the corresponding image in the immediately higher resolution level with a Gaussian kernel and subsampling by a factor of two. This operation was implemented in a very careful way, in order to guarantee the desired real-time feasibility. Instead of using some general convolution routine that works with arbitrary kernels, we implemented a
Real-Time Pose Estimation and Control for Convoying Applications
239
hand-optimized function that convolves images with a specific 3 • 3 Gaussian kernel whose elements are powers of two. The use of a single predefined kernel eliminates the need to keep its elements either in specially-allocated registers or in memory, speeding up the critical inner loop of the convolution. The next step is the segmentation of the target in the image. In order to obtain some robustness with respect to variations in the illumination and in the background of the scene, we perform a histogramic analysis to determine an ideal threshold to binarize the monochromatic images grabbed by the Matrox Meteor in the trailing robot, so that the black dots in the target can be told apart from the rest of the scene (details are provided in [4]). For efficiency purposes, the grey-level frequency information needed to generate the histograms is gathered on-the-fly, during the subsampling process. The next step is to detect and label all the connected regions of low intensity (according to the selected threshold) in the image. This is done using the local blob coloring algorithm described in Ballard and Brown [1]. Initially, this algorithm is used to detect all the dark regions in a level of low resolution in the pyramid. In this phase, in addition to labeling all the connected components, we also compute, on-the-fly, their bounding boxes, centroids and masses. The dark regions detected in the image are compared against the appearances predicted for the target's circles by the tracker described in Section 3.. For each predicted appearance (converted to the appropriate level of resolution), we initially label as matching candidates all the detected regions with similar mass and aspect ratio. Among these, the detected region whose centroid is closest to the position predicted by the tracker is selected as the final match for the corresponding circle in the target. The selected bounding boxes are then converted to a level of high resolution, and the blob coloring algorithm is used on each resulting window, in order to refine the precision of the estimates for the centroids. The resulting image positions are used as inputs to the tracker, that recovers the 3-D pose of the target, predicts how this pose will evolve over time, and then reprojects the 3-D predictions into the 2-D image plane, in order to calculate new predicted appearances for the black circles, which are used on the next step of the low level digital image processing.
6. V i s u a l
Control
and Real-World
Experiments
In addition to the tracking of the leading platform, the problem of smart convoying also requires the motion of the trailing robot to be properly controlled, so that the target never disappears from its field-of-view (or alternatively, it is reacquired whenever it disappears). In our system, this control is based on the 30 Hz error signal corresponding to the values recovered for t~ and tz (t~ is used only in the prediction of the appearance of the target on the next frame), and also on odometry data.
240
R.L. Carceroni, C. Harman, C. K. Eveland, and C. M. Brown
The of odometry is important because the true dynamics of our mobile platforms is quite complex. The two motors are directly driven by P C M signals generated by a proprietary controller which interprets the o u t p u t of either an Onset M68332 micro-controller, or a joystick (for manual control). So, all t h a t the M68332 sees is an abstraction of the motors. It can issue commands to change the velocity or the steering direction of the platform, but it can not control the wheels individually. The true effects of the issued commands depend on a number of factors t h a t are difficult to model exactly, such as differences in the calibration of the motor torques and the relative orientations between the two wheels and a set of three passive casters t h a t are used to create a stable set of contact points between the platform and the ground. Because of these imprecisions, an open-loop sequence of "gostraight" commands issued by the M68332 can actually make the platform move along a curved trajectory, for instance. In order to overcome this difficulty, we use data obtained by two bidirectional hollow shaft encoders [2] (one for each wheel) to close the loop so as to guarantee that the angular velocities on the two wheel axes actually correspond to the desired motion patterns. Each encoder generates two square waves, with a ninety-degree phase lag. These o u t p u t signals are decoded by customized hardware and used to decrement or increment a specific register in the M68332 each time the corresponding wheel rotates back or forth, respectively, by an arc equivalent to the precision of the shaft encoder. Variations in these registers are then compared with the desired values for the angular velocities of the wheels, so as to create error signals t h a t are fed back to the M68332 controller. On the other hand, the M68332 by itself does not provide enough computational power to process high-bandwidth signals, such as visual data, in real-time. So, we augmented the system with twin 133 MHz Pentium processors, t h a t are used to process the digitized image sequences so as to extract the desired visual measurements and estimate motion (as explained in Sections 4. and 5.). This s e t - u p naturally leads to a two-level control strategy. In the Pentium processors, a higher level composed by two low-frequency (30 Hz) PID controllers (with proportional, integral and derivative gains empirically set) converts the tz and t~ signals, respectively, into ideal speed and steering commands for the platform. The goal of one of this controllers is to keep tz equal to a convenient predefined value, while the other aims to keeping t~ equal to zero. These commands are then passed down to a lower level, t h a t runs at 100 Hz in the M68332. This level also uses two PID controllers with empirically-set gains. One of them uses differences in the rates of change of the tick counters for the two wheels to stabilize steering, while the other uses the average of these rates of change to stabilize the velocity. So, from the point-of-view of the higher level, this lower level creates an abstraction for the dynamics of the mobile platform t h a t is much simpler than reality, since
Real-Time Pose Estimation and Control for Convoying Applications
241
Fig. 6.1. Nine frames from a sequence that shows a completely autonomous mobile platform following a manually-driven platform around a cluster of tables. The temporal sequence of the frames corresponds to a row-major order. the unpredictable effects of several imprecisions are compensated through the use of odometry. The communication between the Pentium board and the M68332 is performed with a specialized serial protocol whose design and implementation are described in [2]. In order to evaluate our approach for convoying, we ran some experiments with real data. Basically, we used the methodology described so far to try to make one of our two identical mobile platforms follow the other (manually driven) at a roughly constant distance of about 5 feet. These experiments were performed in indoor environments with varying lighting conditions. It was verified that the controller performs quite well in the sense that it manages to keep the leading platform in view, actually keeps the distance roughly constant, tolerates changes in lighting conditions, and can reliably track turns of up to 180 degrees without losing target features, as illustrated by the sequence of Fig. 6.1.
7. C o n c l u s i o n s This results support our position that by putting together traditional computer vision techniques carefully customized to the to meet applicationspecific needs, it is possible to tackle challenging problems with low-cost
242
R . L . Carceroni, C. Harman, C. K. Eveland, and C. M. Brown
off-the-shelf hardware. In the specific case of convoying, we have shown, in a careful evaluation with synthetic data, that specialized motion analysis algorithms that take into account domain-specific constraints such as the existence of a unique ground plane often yield more accurate and stable results than totally generic techniques, even when these assumptions are only partially met. Finally, we suggested a two-level approach for control, in which high-frequency o d o m e t r y data is used to stabilize visual control. This paper describes work that is still in progress and we stress the fact that some of the issues raised here need further investigation. In our opinion one of the most interesting directions in which this work must be continued is with a deeper investigation of which is the best control strategy for the application at hand. Our current controller assumes "off-road" conditions: it is permissible always to head directly at the lead vehicle, thus not necessarily following its path. If vehicles must stay "on road", the follower may be forced to r e - t r a c e the t r a j e c t o r y of the leader precisely. State estimation of the leader's heading (global steering angle, say) as well as speed (or accelerations) are ultimately needed, to be duplicated for local control. Vision becomes harder since the follower cannot always aim itself at the leader. The desired trajectory is known, which turns the problem into one that can perhaps more usefully be related to optimal control than to simple feedback control.
Acknowledgement. This chapter is based on work supported by CAPES process BEX-0591/95-5, NSF IIP grant CDA-94-01142, NSF grant IRI-9306454 and DARPA grant DAAB07-97-C-J027.
References 1. D. H. Ballard and C. M. Brown. Computer Vision. Prentice-Hall, 1982. 2. J. D. Bayliss, C. M. Brown, R. L. Carceroni, C. K. Eveland, C. Harman, A. Singhal, and M. V. Wie. Mobile robotics 1997. Technical Report 661, U. Rochester Comp. Sci. Dept., 1997. 3. R. L. Caxceroni and C. M. Brown. Numerical methods for model-based pose recovery. Technical Report 659, U. Rochester Comp. Sci. Dept., 1997. 4. R. L. Carceroni, C. Harman, C. K. Eveland, and C. M. Brown. Design and evaluation of a system for vision-based vehicle convoying. Technical Report 678, U. Rochester Comp. Sci. Dept., 1998. 5. D. F. DeMenthon and L. S. Davis. Model-based object pose in 25 lines of code. Int. J. of Comp. Vis., 15:123-141, 1995. 6. D. B. Gennery. Visual tracking of known three-dimensional objects. Int. J. Comp. Vis., 7(3):243-270, 1992. 7. R. Horaud, S. Christy, F. Dornaika, and B. Lamiroy. Object pose: Links between paraperspective and perspective. In Proc. Int. Conf. Comp. Vis., pages 426-433, 1995. 8. Y. Liu, T. S. Huang, and O. D. Faugeras. Determination of camera location from 2-D to 3-D line and point correspondences. IEEE Trans. PAMI, 12(1):28-37, 1990.
Real-Time Pose Estimation and Control for Convoying Applications
243
9. D. G. Lowe. Three-dimensional object recognition from single two-dimensional images. Artif. Intell., 31(3):355-395, 1987. 10. D. G. Lowe. Fitting parameterized three-dimensional models to images. IEEE Trans. PAMI, 13(5):441-450, 1991. 11. P. F. McLauchlan and D. W. Murray. Recursive affme structure and motion from image sequences. In Proe. European Conf. on Comp. Vis., pages 217-224, 1994. 12. P. F. McLanchlan and D. W. Murray. A unifying framework for structure and motion recovery from image sequences. In Proc. IEEE Int. Conf. on Comp. Vis., pages 314-320, 1995. 13. D. Oberkampf, D. F. DeMenthon, and L. S. Davis. Iterative pose estimation using coplanar feature points. Comp. Vis. Image Understanding, 63(3):495-511, 1996. 14. T. Q. Phong, R. Horaud, and P. D. Tao. Object pose from 2-D to 3-D point and line correspondences. Int. J. Comp. Iris., 15:225-243, 1995. 15. C. Wiles and M. Brady. Ground plane motion camera models. In Proc. European Conf. on Comp. Vis., volume 2, pages 238-247, 1996.
V i s u a l R o u t i n e s for V e h i c l e C o n t r o l Garbis Salgian and Dana H. Ballard Computer Science Department University of Rochester Rochester, NY 14627 USA
1. I n t r o d u c t i o n Automated driving holds the promise of improving traffic safety, alleviating highway congestion and saving fuel. The continuous increase in processor speed over the last decade has led to an increased effort in research on automated driving in several countries [1]. However, autonomous tactical level driving (i.e. having the ability to do traffic maneuvers in complex, urban type environments) is still an open research problem. As little as a decade ago, it was widely accepted that the visual world could be completely segmented into identified parts prior to analysis. This view was supported in part by the belief that additional computing cycles would eventually be available to solve this problem. However the complexity of vision's initial segmentation can easily be unbounded for all practical purposes, so that the goal of determining a complete segmentation of an individual scene in real time is impractical. Thus to meet the demands of ongoing vision, the focus has shifted to a more piecewise and on-line analysis of the scene, wherein just the products needed for behavior are computed as needed. Such products can be computed by visual routines [14], special purpose image processing programs that are designed to compute specific parameters that are used in guiding the vehicle. This paper describes the development and testing of visual routines for vehicle control. It addresses the generation of visual routines from images using appearance based models of color and shape. The visual routines presented here are a major component of the perception subsystem of an intelligent vehicle. The idea of visual routines is compelling owing to the fact that being special-purpose vast amounts of computation can be saved. For this reason they have been used in several simulations (eg. [9]), but so far they have been used in image analysis only in a few restricted circumstances.
2. P h o t o - r e a l i s t i c
simulation
Autonomous driving is a good example of an application where it is necessary to combine perception (vision) and control. However, testing such a system in the real world is difficult and potentially dangerous, especially in complex dynamic environments such as urban traffic.
Visual Routines for Vehicle Control
245
Fig. 2.1. The graphical output of the simulator is sent to the real-time image processing hardware (Datacube color digitizer and MV200 processing board) which is connected to a host computer. The host analyzes the incoming images and sends back to the simulator controls for the vehicle and virtual camera. Given recent advances in computer graphics, both in terms of the quality of the generated images and the rendering speed, we believe t h a t a viable alternative to initial testing in the real world is provided by integrating photorealistic simulation and real-time image processing. This allows testing the computer vision algorithms under a wide range of controllable conditions, some of which would be too dangerous to do in an an actual car. The resultant testbed leads to rapid prototyping. Terzopoulos pioneered the use of simulated images in his animat vision architecture. However, in their approach all the processing is carried out in software, one of the motivations for the architecture being that it avoids the difficulties associated with "hardware vision" [13]. In our case, the graphical output from the simulator is sent to a separate subsystem (host computer with pipeline video processor) where the images are analyzed in real-time and commands are sent back to the simulator (figure 2.1). The images are generated by an SGI Onyx Infinite Reality engine which uses a model of a small town and the car. Visual routines are scheduled to meet the t e m p o r a r y task demands of individual driving sub-problems such as stopping at lights and traffic signs. The output of the visual routines is used to control the car which in turn affects the subsequent images. In addition to the simulations, the routines are also tested on similar images generated by driving in the real world to assure the generalizability of the simulation. The simulator can also be used with human subjects who can drive a kart through the virtual environment while wearing head mounted displays (HMD). A unique feature of our driving simulator is the ability to track eye movements within a freely moving VR helmet which allows us to explore the scheduling tradeoffs that humans use. This provides a benchmark for the automated driver and also is a source for ideas as to priorities assigned by the human driver. In particular, the fixation point of the eyes at any moment is an indicator of the focus of attention for the human operator. Experiments show that this fixation point can be moved at the rate of one fixation every 0.3
246
Garbis Salgian and Dana H. Ballard
to 1 second. Studying the motion of this fixation point provides information on how the human driver is allocating resources to solve the current set of tactical driving-related problems.
3. P e r c e p t u a l
and
Control
Hierarchy
The key problem in driving at a tactical level is deciding what to attend to. In our system this problem is mediated by a scheduler, which decides which set of behaviors to activate. The central thesis is that, at any moment, the demands of driving can be met by a small number of behaviors. These behaviors, in turn, invoke a library of visual routines. Recent studies have shown that humans also switch among simple behaviors when solving more complex tasks [4]. The hierarchy of perception and control levels t h a t forms the framework for driving is presented in figure 3.1. While not exhaustive, all the relevant levels in implementing a behavior are represented. At the top a scheduler selects from a set of task-specific behaviors the one that should be activated at any given moment. The behaviors use visual routines to gather the information they need and act accordingly. Finally, the visual routines are composed from an alphabet of basic operations (similar to Ullman's proposal [14]). The hierarchy in Figure 3.1 has many elements in common with that of Maurer and Dickmanns [6]. The main difference is one of emphasis. We have focused on the role of perception in vehicle control, seeking to compute task-related invariants that simplify individual behaviors. To illustrate how modules on different levels in the hierarchy interact, consider the case when the scheduler activates the stop sign behavior. To determine whether there is a stop sign within certain range, the stop sign detection visual routine is invoked. The routine in its turn uses several basic operations to determine if a stop sign is present. For instance, it uses color to reduce the image area that needs to be analyzed. If there are any red blobs, they are further verified to see if they represent stop signs by checking distinctive image features. Finally, the routine returns to its caller with an answer (stop sign found or not). If no stop sign was detected, that information is passed to the scheduler which can then decide what behavior to activate next. On the other hand, if a stop sign was found, the agent has to stop at the intersection to check for traffic. For that, it needs to know where the intersection is, so the intersection detection routine will be activated. It can use static image features (eg. lines, corners) to determine where in the image the intersection is located. At the behavior level this information can be used for visual servoing until the intersection is reached. The shaded modules in figure 3.1 are the ones t h a t have been implemented so far. Road following has been intensely studied for more than a decade [1], [7] and it was successfully demonstrated at high speeds and over extended
Visual Routines for Vehicle Control High-Ira,el
~
Conb'o/
':::":-il
~
-.,I. . . . u-
247
Implemented []
N O t l ~
l~vlng Behaviors
W,~wl Roul~nes
Fig. 3.1. Hierarchy of perceptual and control modules. At the top level, the scheduler selects the behavior that is currently active. This behavior uses one or more visual routines to gather the information it needs to take the appropriate decisions. The routines are composed from a set of low-level basic operations. Shaded modules are the ones currently implemented. distances. Therefore we decided not to duplicate these efforts initially and instead to take advantage of the simulated environment. In our experiments the car is moving on a predefined track and the driving program controls the acceleration (the gas and break pedals ). 3.1 B a s i c
operations
At the lowest level in the hierarchy are basic operations. These are simple low-level functions which can be used in one or more of the higher level task-specific visual routines. The implementation uses special real-time image processing hardware, namely two Datacube boards. One is a color digitizer (Digicolor) and the other is the main processing board (MV200). C o l o r . The role of the color primitive is to detect blobs of a given color. An incoming color image is digitized in the Hue, Saturation, Value color space. Colors are defined as regions in the hue-saturation sub-space and a lookup table is programmed to o u t p u t a color value for every hue-saturation input pair. A binary map corresponding to the desired color is further extracted and analyzed using a blob labeling algorithm. The end result is a list of bounding rectangles for the blobs of that color. S t a t i c f e a t u r e s . The. role of the static feature primitive is to detect objects of a specific appearance. It uses steerable filters, first proposed by Freeman and Adelson [3], who showed how a filter of arbitrary orientation and phase can be synthesized from a set of basis filters (oriented derivatives of a two-dimensional, circularly symmetric Gaussian function). Other researchers
248
Garbis Salgian and Dana H. Ballard
have used these filters for object identification [8]. The idea is to create a unique index for every image location by convolving the image at different spatial resolutions with filters from the basis set. If M filters are applied on the image at N different scales, an M x N element vector response is generated for every image position. For appropriate values of M and N , the high dimensionality of the response vector ensures its uniqueness for points of interest. Searching for an object in an image is realized by comparing the index of a suitable point on the model with the index for every image location. The first step is to store the index (response vector) r m for the chosen point on the model object. To search for that object in a new image the response r i at every image point is compared to r m and the one that minimizes the distance dim = [Jri - rm[[ is selected, provided that dim is below some threshold. More details about the color and static feature primitives and their realtime implementation on the Datacube hardware are given in [10]. D y n a m i c f e a t u r e s . The goal of this primitive is to detect features that expand or contract in the visual field. The primitive combines three separate characteristics. Each of these have been explored independently, but our design shows that there are great benefits when they are used in combination, given the particular constraints of the visual environment during driving. The first characteristic is that of the special visual structure of looming itself. In driving, closing or losing ground with respect to the vehicle ahead creates an expansion or contraction of the local visual field with respect to the point of gaze [5]. The second one is that the expansion and contraction of the visual field can be captured succinctly with a log-polar mapping of the image about the gaze point [11]. The third characteristic is that the looming is detected by correlating the responses of multiple oriented filters. Starting from It, the input image at time t, the first step is to create LPt, the log-polar mapping at time t. This is done in real time on the pipeline video processor using the miniwarper, which allows arbitrary warps. Since dilation from the center in the original image becomes a shift in the new coordinates, detecting looming in the original input stream It translates into detecting horizontal shifts in the stream of transformed images LPt, with 0 ~ t < tma x
Another reason for using a log-polar mapping is that the space-variant sampling emphasizes features in the foveal region while diminishing the influence of those in the periphery of the visual field (figure 3.3 left). This is useful in the car following scenario, assuming fixation is maintained on the leading vehicle, since it reduces the chance of false matches in the periphery. The Dynamic Feature Map (DFM) indicates the regions in the image where a specified shift is present between LP~ and LPt-1. DFMs,t denotes the map at time t with a shift value s and is obtained by correlating LP~ with LP~_1 (where the superscript indicates the amount of shift).
Visual Routines for Vehicle Control
249
In order to reduce the number of false matches, the correlation is performed in a higher dimensional space by analyzing the responses of five different filters (from the same basis set as in the static feature case). 3.2 Visual R o u t i n e s
Fig. 3.2. Stop sign detection routine. Basic operations are combined into more complex, task-specific routines. Since the routines are task-specific, they make use of high level information (eg. a geometric road model, known ego-motion, etc.) to limit the region of the image that needs to be analyzed, which leads to reduced processing time. We have implemented routines for stop light, stop sign and looming detection. S t o p light d e t e c t i o n . The stop light detection routine is an application of the color blob detection primitive to a restricted part of the image. Specifically, it searches for red blobs in the upper part of the image. If two red blobs are found within the search area, then a stop light is signaled. Currently, the search window is fixed a priori. Once we have a road detection routine, we will use that information to adjust the position and size of the window dynamically. Stop s i g n d e t e c t i o n . The area searched for stop signs is the one on the right side of the road (the white rectangle in the right side of every image in figure 3.2). First, the color primitive is applied to detect red blobs in this area, which are candidates for stop signs. Since other red objects can appear in this region (such as billboards, brick walls, etc.) the color test alone is not enough for detecting the stop signs, being used just as a "focus of attention" mechanism to further limit the image area that is analyzed. Once a red blob is detected, the static feature primitive is applied to determine whether any of the filter responses rx,yi in that area (dashed white
250
Garbis Salgian and Dana H. Ballard
rectangle) matches the previously stored response for a stop sign r m. If the error (difference) is below some predetermined threshold, a stop sign is reported. The two routines have been tested both in simulation and on real world video sequences. Sample results are presented in [10]
Fig. 3.3. The overall looming detection routine combines the results of two dynamic feature maps, one using a positive shift, another using a negative shift. L o o m i n g d e t e c t i o n . The looming detection routine applies two instances of the dynamic feature primitive (for two equal shifts of opposite signs) on consecutive frames in log-polar coordinates. Figure 3.3 illustrates the main steps and some intermediate results for the case when the leading car is approaching (expanding from It-1 to It). Consequently, features on the car shift to the right from LPt-1 to LPt and show up in DFMs,t but not in
DFM-8,t. A single dynamic feature map DFMt is computed as the difference of DFMs,t and DFM-8,t. By taking the difference of the two maps, the sensitivity to speeds in the region where the distributions for s and - s overlap is reduced. This is visible in figure 3.3, where features from the building in the background are present in both DFM,,t and DFM-,,t, but cancel each other in DFMt, which contains only the features corresponding to the car. DFMt is analyzed for blobs and the list of blobs (with size, centroid and sign) is returned. The sign indicates whether it is a dilation or a contraction. If there is more than one blob, the correspondence is determined across frames based on the distance between them. The tracking can be further simplified by analyzing only a sub-window of the dynamic feature map corresponding to a region of interest in the original image (eg. the lower part if the shadow under the lead vehicle is tracked).
Visual Routines for Vehicle Control
251
3.3 D r i v i n g B e h a v i o r s Visual routines are highly context dependent, and therefore need an enveloping construct to interpret their results. For example, the stop light detector assumes the lights are in a certain position when gaze is straight ahead, thus the stop light behavior has to enforce this constraint. To do this, the behaviors are implemented as finite state machines, presented in figure 3.4. Traffic l i g h t b e h a v i o r . The initial state is "Look for stop lights", in which the traffic light detection routine is activated. If no red light is detected the behavior returns immediately. When a red light is detected, the vehicle is instructed to stop and the state changes to "Wait for green light" in which the red light detector is executed. When the light changes to green, the routine will return "No red light" at which time the vehicle starts moving again and the behavior completes. Call
Return
Call
~.-----...
Ifou.: S~toPign ~Stop
Return
No stop ~
sign
~ ~topp ~ A
sign notv,si ble B
Fig. 3.4. Finite state machines used for two driving behaviors (A traffic light behavior and B stop sign behavior). S t o p s i g n b e h a v i o r . In the "Look for stop signs" state the stop sign detection routine is activated. If no sign is detected the behavior returns immediately. When a stop sign is detected, the agent needs to stop at the intersection. Since we don't have an intersection detector yet, once the stop sign is detected, the state changes to "Track stop sign" in which the vehicle moves forward while tracking the sign. When the sign is no longer visible, a new state is entered in which the agent stops and pans the camera left and right. C a r f o l l o w i n g b e h a v i o r s . The looming detection routine can be used to build a car following behavior. Two such behaviors are presented, one purely reactive and another one that tries to maintain a constant distance to the leading vehicle. Reactive behavior. This behavior does not model the motion of the leading vehicle. It has a default speed Vdel, at which the vehicle is moving if nothing is detected by the looming routine. When there is something looming in front of the vehicle, the routine returns the horizontal coordinate of the corresponding blob centroid in the D F M and its sign. Based on these two inputs, the desired
252
Garbis Salgian and Dana H. Ballard
speed Vd~s is computed to ensure that the maximum brake is applied when the leading vehicle is close and approaching and the maximum acceleration is applied when the distance to the lead vehicle is large and increasing. The actual vehicle speed is determined by the current speed, the desired speed and vehicle dynamics. C o n s t a n t d i s t a n c e b e h a v i o r . This behavior tries to maintain a constant distance to the leading vehicle by monitoring the position of the blob centroid Xc in the dynamic feature map. The desired relative distance is specified by the corresponding horizontal position in log-polar coordinates Xde s. The error signal xeTr = Xd~s -- Xc is used as input to a proportional plus integral (PI) controller whose output is the vehicle desired speed.
3.4 Scheduling Given a set of behaviors, and a limited processing capacity, the next problem to address is how to schedule them in order to ensure that the right behavior is active at the right time. This issue has been addressed by other researchers and several solutions have been proposed: inference trees [9], and more recently, distributed architectures with centralized arbitration [12]. We are currently investigating different alternatives for the scheduler design. So far our principal method is to alternate between the existing behaviors, but there are important subsidiary considerations. One is that the performance of difficult or important routines can be improved by scheduling them more frequently. Another is that the performance of such routines can be further improved by altering the behavior, for example by slowing down. The effect of different scheduling policies is addressed in [10].
4. Experiments The two car following behaviors have been tested in simulation. The leading vehicle is moving at a constant speed of 48 k m / h and the initial distance between vehicles was around 20 meters. For the reactive case, the default speed was set to 58 km/h. The results are shown in the center column of figure 4.1: the upper plot is the vehicle speed, and the lower one is the relative distance. The reactive characteristic of the behavior is noticeable in the speed profile, which has a seesaw pattern. The distance to the leading car varies significantly, which is to be expected since the controller has no model of the relative position of the vehicles. In the case of the constant distance behavior, the desired position was set initially to correspond to a relative distance of about 20 meters, and after 10 seconds it was changed to a relative distance of about 11 meters. The upper right plot in figure 4.1 shows the speed profile of the vehicle, which is closer to the speed of the leading vehicle than in the reactive case. Also, the relative
Visual Routines for Vehicle Control
253
7O
Io
Tkne {~)
20
3o
Time (s)
to
40
20
30
40
so
40
50
Time {~)
2S
J. 10
20
30
]line (s)
40
50
10
20
30
Time (s)
40
50
10
20
30
Time (8)
Fig. 4.1. Vehicle speed (up) and distance between vehicles (down) for a human driver (left) and two instances of the robot driver with different car following behaviors: reactive (center) and constant distance (right). The results show that the former has an absolute error of about 5 meters and the latter about 1.5 meters. distance (lower right plot) varies significantly less around the desired value. The response to the step change is relatively slow (about 10 seconds to get at the new relative distance), but this is determined by the parameters of the controller. We have not extensively experimented with the possible Parameter values, the main focus so far being to show that the looming detection routine provides a robust enough signal t h a t can be used in a car following behavior. The leftmost column in figure 4.1 shows the results for a human driving in the same virtual environment. T h e fact that humans also exhibit a characteristic sawtooth pattern in speed change may suggest that they rely on the looming cue for car following (as opposed to using other image cues to estimate the relative distance). The tests here have assumed the functioning of the tracking system that can identify the rough position of the lead vehicle during turns. This information is in the optic flow of the dilation and contraction images in t h a t vertical motion of the correlation images indicates turns. Figure 4.2 shows the real angular offset (dotted line) and the value recovered from the vertical position of the blob corresponding to the lead vehicle in the dynamic feature map (solid line). The right side shows the same data, after removing the lag. Our future plan is to use the measured angular offset to control the panning of the virtual camera in order to maintain fixation on the lead vehicle when it turns.
254
Garbis Salgian and Dana H. Ballard
}
75
....
"
-2~
"~
(,)
20 TI~
(8)
Time r
| -1o -15
10
30
40
~
~0
20 Time (s)
30
40
Fig. 4.2. Left: Angular offset of the lead vehicle in the visual field of the follower for a road segment with two turns (the camera is looking straight ahead); Right: Same data, after removing the lag due to rendering and image processing.
5. C o n c l u s i o n s Driving is a demanding dynamically changing environment t h a t places severe temporal demands on vision. These demands arise owing to the need to do a variety of things at once. One way to meet t h e m is to use specially-designed visual behaviors t h a t are geared to sub-parts of the task. Such visual behaviors in turn use visual routines t h a t are specialized to location and function. Our hypothesis is that: 1. The complex behavior of driving can be decomposed into a large library of such behaviors, and 2. At any m o m e n t the tactical demands of driving can be meet by special purpose hardware t h a t runs just a collection of these behaviors, and 3. The a p p r o p r i a t e subset of such routines can be selected by a scheduling algorithm t h a t requires a much slower t e m p o r a l bandwidth. We d e m o n s t r a t e d this design by implementing three such behaviors, a stop sign behavior, a traffic light behavior and a car following behavior. All three take advantage of special purpose video pipeline processing to execute in approximately 100 milliseconds, thus allowing real-time behavior. The tests of the looming behavior show t h a t it is extremely robust, and is capable of following cars over a wide range of speeds and following distances. The obvious alternate strategy for car-following would be to track points on the lead car. This has been tried successfully [2] but requires t h a t the tracker identify points on the vehicle over a wide variety of illumination conditions. In contrast the m e t h o d herein does not require t h a t the scene be segmented in any way. It only requires t h a t the visual system can track the lead vehicle
Visual Routines for Vehicle Control
255
during turns and t h a t the relative speeds between t h e m are slower t h a n their absolute speeds. As of this writing, the various behaviors have only been tested under simple conditions. Future work will test the robustness of the scheduler under more complicated driving scenarios where the demands of the visual behaviors interact. One such example is t h a t of following a car while obeying traffic lights. The demonstration system is a special design t h a t allows the output of a Silicon Graphics Onyx Infinite Reality to be sent directly to the video pipeline computer. T h e results of visual processing are then sent to the car model and a p p r o p r i a t e driving corrections are made. This design is useful for rapid prototyping, allowing m a n y situations to be explored in simulation that would be dangerous, slow or impractical to explore in a real vehicle.
Acknowledgement. This research was supported by NIH/PHS research grant 1-P41RR09283
References 1. E. D. Dickmanns. Performance improvements for autonomous road vehicles. In Proceedings of the ~th International Conference on Intelligent Autonomous Systems, pages 2-14, Karlsruhe, Germany, March 27-30 1995. 2. U. Franke, F. BSttiger, Z. Zomotor, and D. Seeberger. Truck platoonong in mixed traffic. In Proceedings of the Intelligent Vehicles '95 Symposium, pages 1-6, Detroit, USA, September 25-26 1995. 3. W. T. Freeman and E. H. Adelson. The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9):891906, September 1991. 4. M. F. Land and S. Furneaux. The knowledge base of the oculomotor system. Sussex Centre for Neuroscience, School of Biological Sciences, University of Sussex, Brighton BN1 9QG, UK, 1996. 5. D. N. Lee. A theory of visual control of braking based on information about time-to-collision. Perception, 5:437-459, 1976. 6. M. Maurer and E. Dickmanns. An advanced control architecture for autonomous vehicles. In Navigation and Control Technologies for Unmanned Systems II, volume 3087 of SPIE, pages 94-105, Orlando, FL, USA, 23 April 1997. 7. D. Pomerleau. Ralph: rapidly adapting lateral position handler. In Proceedings of the Intelligent Vehicles '95 Symposium, pages 506-511, New York, NY, USA, September 1995. IEEE. 8. R. P. Rao and D. H. Ballard. Object indexing using an iconic sparse distributed memory. In ICCV-95, pages 24-31, June 1995. 9. D. A. Reece. Selective perception for robot driving. Technical Report CMUCS-92-139, Carnegie Mellon University, 1992. 10. G. Salgian and D. H. Ballard. Visual routines for autonomous driving. In Proceedings of the 6th International Conference on Computer Vision (ICCV98), pages 876-882, Bombay, India, January 1998. 11. E. L. Schwartz. Anatomical and physiological correlates of visual computation from striate to infero-temporal cortex. IEEE Transactions on systems, man and cybernetics, SMC-14(2):257-271, April 1984.
256
Garbis Salgian and Dana H. Ballard
12. R. Sukthankar. Situation Awareness for Tactical Driving. PhD thesis, Robotics Institute, CMU, Pittsburg PA 15213, January 1997. CMU-RI-TR-97-08. 13. D. Terzopoulos and T. F. Rabie. Animat vision: Active vision in artificial animals. In ICCV-95, pages 801-808, June 1995. 14. S. Ullman. Visual routines. Cognition, (18):97-160, 1984.
Microassembly of Micro-electro-mechanical Systems (MEMS) using Visual Servoing John Feddema and Ronald W. Simon Sandia National Laboratories P.O. Box 5800, MS 1003 Albuquerque, NM 87185
Summary. This paper describes current research and development on a robotic visual servoing system for assembly of MEMS (Micro-Electro-Mechanical) parts. The workcell consists of an AMTI robot, precision stage, long working distance microscope, and LIGA (Lithography Galvonoforming Abforming) fabricated tweezers for picking up the parts. Fourier optics methods are used to generate synthetic microscope images from CAD drawings. These synthetic images are used off-line to test image processing routines under varying magnifications-and depths of field. They also provide reference image features which are used to visually servo the part to the desired position.
1. Introduction Over the past decade, considerable research has been performed on Robotic Visual Servoing (RVS) (see [1][2] for review and tutorial). Using real-time visual feedback, researchers have demonstrated that robotic systems can pick up moving parts, insert bolts, apply sealant, and guide vehicles. With the rapid improvements being made in computing, image processing hardware, and CCD cameras, the application of RVS techniques are now becoming widespread. Ideal applications for RVS are typically those which require extreme precision and cannot be performed cost effectively with fixturing. As the manufacturing lot size of the product increases, it is usually more cost effective to design a set of fixtures to hold the parts in the proper orientations. However, for small lot sizes and large numbers of diverse parts, vision becomes an essential sensor. Historically, computer vision has been used in a look-and-move mode where the vision system first locates the part in robot world coordinates, and then the robot moves "blindly" to that location and picks up the part. In the 1980s, computing and image processing hardware improved to the point where vision can now be used as a continual feedback sensor for controlling the relative position between the robot and the part. RVS is inherently more precise than look-and-move vision because an RVS error-driven control law improves the relative positioning 1
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL85000.
258
Feddema and Simon
accuracy even in the presence of modeling (robot, camera, or object) uncertainties. One ideal application for RVS which meets these requirements is the microassembly of MEMS (Micro-ElectroMechanical Systems) components. In recent years, the world economy has seen expansive market growth in the area of MEMS. It is predicted that the MEMS market could reach more than $34 billion by the year 2002. Today, commercially available MEMS products include inkjet printer heads and accelerometers for airbags. These products require little or no assembly because a monolithic integrated circuit process is used to develop the devices. However, future MEMS products may not be so fortunate. Monolithic integration is not feasible when incompatible processes, complex geometry, or different materials are involved. For these cases, new and extremely precise micromanipulation capabilities will be required for successful product realization. Sandia National Laboratories is currently developing manufacturing processes to make MEMS parts with 10-100 micron outer dimensions and submicron tolerance for use in weapons surety devices. In particular, Sandia is pursuing both surface machined silicon and LIGA (Lithography Galvonoforming Abforming) parts. The surface machined silicon devices are fabricated in place using several layers of etched silicon and generally do not require assembly. However, the LIGA parts are batch fabricated and do require assembly. The LIGA parts are formed by using X-ray lithography to create molds in PMMA (polymethylmethacrylate) and then electroplating metals (typically nickel, permalloy, and copper) in the molds. Afterwards, the metal parts are released into Petrie dishes. LIGA parts are of special interest because they can be made thicker than silicon parts (hundreds of microns verses tens of microns), they can be made of metals which makes them stronger in tension than surface machined silicon, and they can contain iron which allows them to be configured as miniature electromagnetic motors. The disadvantage of LIGA parts over silicon structures is that they must be assembled. The required precision, operator stress and eye strain associated with assembling such minute parts under a microscope generally precludes manual assembly from being a viable option. An automated assembly system addresses these problems. There are several reasons why RVS is ideally suited for the assembly of LIGA parts. First, from a physiological stand point, human beings exclusively use their vision sense to assemble parts this size. People do not use force feedback because they can not feel micro-Newtons of force. Second, since the LIGA parts are randomly placed in dishes and it is difficult to design parts feeders and fixturing with submicron tolerances, vision is required to locate the parts. Third, the environment under a microscope is structured and the parts are well known. Fourth, most assembly operations are 4 degrees of freedom (DOF) problems (x, y, z, and rotation about z). These last two points greatly simplify the image processing required to recognize and locate the parts.
Microassembly of Micro-Electro-mechanical Systems Using Visual Servoing
259
In addition to the above points, this problem is well suited for a completely automated manufacturing process based on CAD information. The LIGA parts are originally designed using CAD packages such as AutoCAD, ProE, or Velum. The designs are then translated to GDSII, which is the format that the mask shops use to develop the X-ray masks for the LIGA process. Therefore, we already have CAD information on each part. Also, since X-rays are used to develop the LIGA molds, both the horizontal and vertical tolerances of the parts are quite precise (submicron horizontal tolerances, and 0. I micron taper over 100 microns of thickness). Therefore, there is excellent correspondence between the CAD model and the actual part. If a synthetic microscope image of the part could be created, it would solve one very important RVS issue: where do the image reference features come from? The reference features could be learned through teach-by-showing of actual parts, however, this is not cost effective in an agile manufacturing environment. Instead, it would be best if the reference image features could be derived directly from the CAD model. In this way, the model could be used for assembly planning even before the parts are produced. Even with an accurate CAD model, there are several issues that cause microassembly to be a difficult assembly planning problem. As discussed by others in the field [3][4], the relative importance of the interactive forces in microassembly is very different from that in the macro world. Gravity is almost negligible, while surface adhesion and electrostatic forces dominate. To some extent these problems can be reduced by clean parts and grounding surfaces. But the assembly plan should take these effects into account. To date, several different approaches to teleoperated micromanipulation have been attempted. Miyazaki [5] and Kayono [6] meticulously picked up 35 polymer particles (each 2 microns in diameter) and stacked them inside of a scanning electron microscope (SEM). Mitsuishi [7] developed a teleoperated, force-reflecting, micromachining system under a SEM. On a larger scale, Zesch [8] used a vacuum gripper to pick up 100 micron size diamond crystals and deposit them to arbitrary locations. Sulzmann [9] teleoperated a microrobot using 3D computer graphics (virtual reality) as the user interface. More recently, researchers have gone beyond teleoperation to use visual feedback to automatically guide microrobotic systems. Sulzmann [9] illuminated gallium phosphate patches on a microgripper with an ion beam, and he used the illuminated features to locate and position the microgripper. Vikramaditya [10] investigated using visual servoing and depth-from-defocus to bring parts into focus and to a specified position in the image plane. The estimation of depth from focus has also been addressed by several other researchers [ 11-14]. In this paper, we take the next step by creating synthetic images from CAD data. These images are used to test image processing algorithms off-line and to create reference image features which are used on-line for visual servoing. The next four sections describe the workcell, an optics model used to generate
260
Feddema and Simon
synthetic images, resolved rate visual servoing equations, as well as ongoing CAD-Driven assembly planning work.
2. Workcell Description Our microassembly workcell consists of a 4 DOF AMTI (subsidiary o f ESEC) Cartesian assembly system, a 4 DOF precision stage, micro-tweezers, and a long working distance microscope (see Figure 1). The AMTI robot has a repeatability of 0.4 microns in the x and y directions, 8 microns in the z direction, and 23.56 arc-seconds in rotation about z. The precision stage has a repeatability o f approximately 1 micron in x, y, and z, and 1.8 arc-seconds in rotation about z. The microscope is fixed above the stage and has an motor-driven zoom and focus.
Figure 1. Microassembly Workcell.
During assembly operations, the AMTI robot positions the micro-tweezers above the stage and within the field of view of the microscope. The precision stage is used to move the LIGA parts between the fingers of the tweezers. The tweezers are closed on the first part, the stage is lowered, and the mating part on the stage is brought into the field o f view. The stage is then raised into position and the part in the tweezers is released. The micro-tweezers is a LIGA fabricated tweezers [15] which is actuated by a linear ball-and-screw DC motor and a collet style closing mechanism. The
Microassembly of Micro-Electro-mechanical Systems Using Visual Servoing
261
current version of these tweezers is 20.8 mm long, 200 microns thick, and has two fingers which are 100 microns wide. A teleoperated interface was developed to test simple pick-and-place operations. The AMTI robot, the 4 DOF precision stage, the micro-tweezers, and the focus, magnification, and lighting of the microscope are controlled through a custom developed user interface built within an Adept A-series VME controller. The image of the parts as seen by the microscope is displayed on the computer screen. The x and y position of the robot and stage are controlled by the operator by dragging a cursor on the graphical display. Sliders are used to control the z position and theta orientation of the robot and stage as well as the microscope focus, magnification, and lighting.
Figure 2. LIGA tweezers p l a c i n g a LIGA gear on a 44 micron O D shaft.
This teleoperated interface has been used to pick up and place 100 micron O.D. LIGA gears with 50 micron holes on pins ranging from 35 to 49 microns (see Figure 2). The next step is to automate the assembly using visual feedback. The next section describes the optical model used to evaluate the microscope's depth of field and generate synthetic images from CAD data.
3. Optics Modeling When viewing parts on the order o f 100 microns in dimension, it is important to have a precise model o f the optics, including models of the field o f view and depth of field. This model is even more important if the assembly is to be performed automatically from CAD information. What is needed is a way to
262
Feddema and Simon
create a synthetic image before the part is even produced. Then we can design for assembly and determine off-line the required image processing routines for assembly. In this regard, Fourier optics methods can be used to create synthetic images from the CAD drawings. First, we provide a simple review of microscope optics. In our experiments, we are using a long working distance microscope by Navitar. This microscope uses an infinity-focused objective lens. Referring to Figure 3, the rays emanating from a given objective point are parallel between the objective and the tube lens. The tube lens is used to focus the parallel rays onto the image plane. The magnification is calculated by dividing a focal length of the tube lens by the focal length of the objective lens [16].
object ~
~
-
~
image
infinity tube objective lens Figure 3. Infinity corrected microscope optics.
x'= mx and y ' = m y
where
m
= ff to
'
(1)
(x,y) is the object position in the objective focal plane, (x', y') is the projected position in the image plane, m is the lateral magnification, f t is the focal length of the tube lens, and fo is the focal length of the objective. With our microscope, the focal length of the tube lens is adjustable so that the magnification varies from 6.9 to 44.5. A
B C
B'
a g
A n
numerical aperture refractive i n d e x
Figure 4. Geometric depth of field.
Microassembly of Micro-Electro-mechanical Systems Using Visual Servoing
263
The depth of field can be determined by analyzing Figure 4. Here, the objective and tube lens are modeled as a single thick lens with magnification m. The in-focus object plane is denoted as B, and the corresponding in-focus image plane is denoted as B'. When the object is moved out of focus to planes A or C, a point on A or C is projected into a disk of diameter bg on object plane B. The resulting disk in the image plane has diameter bg'. By using similar triangles, the geometric depth of field is given by nbg' Ag = A 1 +A 2 ~ - mA
(2)
where n is the refractive index of the optics, and A is the numerical aperture of the optics [17]. This expression is valid if object blur bg on plane B is much less than the lens aperture radius a. Solving this equation for the defocused blur in the image,
2mAIA[ bg'=
n
(3)
where A = A 1 ~ A 2 . In addition to the geometric depth of field, Fraunhofer diffraction is also important as the objects being viewed approach the wavelength of light. Rayleigh's Criteria [ 18] says that the diameter of the Airy disk in the image plane is 1.22~,m br ' - - A
(4)
where 3. is the wavelength of incident light. This is the diameter of the first zero crossing in an intensity image of a point source when viewed through an ideal circular lens with numerical aperture A and magnification m. Assuming linear optics, the geometric blur diameter and the Airy disk diameter are additive. Adding Equations (3) and (4) and solving for A, the total depth of field is given by: nb' A T = 2A = - mA
1.22~,n A2
(5)
where b'= bg'+b r' is the acceptable blur in the image. The first term is due to geometric optics, while the second term is due to diffraction from Rayleigh's
264
Feddema and Simon
criteria. Since Equation (5) must always be positive, the acceptable geometric blur must be larger than the Airy disk. Note that even when the object is in perfect focus (A T = 0), there is still a small amount of blurring due to diffraction. For example, the parameters for the microscope used in the experiments are n=l.5, ~ =0.6 microns, A =0.42, and m=6.9. The resulting image blur due to diffraction (Airy disk diameter) is 12.026 microns. If the acceptable image blur is 12.45 microns (approximately 1 pixel on a 1/3 inch format CCD), then A T = 0.22 microns. Therefore, two points separated by b '/m or 1.8 microns will become indistinguishable if the points are moved as little as 0.11 microns out of the focal plane! The next problem is how to generate synthetic images which account for the geometric depth of field and the Fraunhoffer diffraction. Using Fourier optics [18], a stationary linear optical system with incoherent lighting is described in terms of a 2D convolution integral:
lim (x',y') = f~Iobj (x,y)S(x - x',y - y')dxdy
(6)
where Iim(X',y' ) is the image intensity, Iobj(x,y ) is the object intensity, and
S(x,y) is the impulse response or point spread function. This convolution is more efficiently computed using Fourier Transforms:
1,m (u, v) = 7obj (u, v)~(u, v)
(7)
where the tilde represents the 2D Fourier Transform of the function, and u and v are spatial frequencies in the x and y directions. Considering only the geometric depth of field, the impulse response is 4
Sg(r',O') =
b
!
r'_< g~2
~(bg') 2
0
(8) r'> bg'~2
where r' is the radial distance for the impulse location in the image plane. The impulse response is radial symmetric about the impulse location and independent of 0'. This implies that a geometrically defocused image is the focused image convolved with a filled circle of diameter bg'. Considering only Fraunhoffer diffraction, the impulse response is
Microassembly of Micro-Electro-mechanical Systems Using Visual Servoing
2 4 2Jl/~
S d (r' , O') = rc a
265
) 2 (9)
-2-~-~Zar, )~fm
where J! (e) is the first order Bessel function, a is the aperture radius, ;L is the wavelength of light, and f is the focal length of the lens. This function is also radial symmetric about the impulse location and independent of 0'. In addition, it is the expression used to generate the Airy disk. It would be computationally expensive to convolve this expression with the original image without the use of Fourier Transforms. Fortunately, there exists a simple expression in the Fourier domain. With incoherent light, the Fourier Transform of the impulse response is given by the autocorrelation of the aperture (pupil) function with its complex conjugate:
Sd(U, V) = ISP * ( x , y ) P ( x + Lz' u,y + Lz'v)dxdy
(10)
For a circular aperture of radius a, the pupil function is
P(ruv'O)={lo
ruvrUV
a
(11)
and the resulting transfer function is given by
-\~a)
J
(12)
where ruv is the radius in image frequency. The combined impulse response of both the geometric depth of field and the Fraunhoffer diffraction is given by the convolution of Sg and Sd, or in the frequency domain, the product of Sg and Sa 9 It should be noted that both Sg and S d act as low pass filters on the image. Sg becomes the more dominant filter as the object is moved out of the focal plane. A block diagram of the entire synthetic image generation process is given in Figure 5.
266
Feddema and Simon
".dxr' file JRead lines and arcs ]
Geometrlc b ' - 2mAIAI Blur DiameterI g n
points (x,y) x' smx - x.' I Y' =: smy'yo'
in mage of dlameter ba' I
Create filled circle
,magicpoints(x' y') I Region Fill I
us Image
~ t~ffg(U v)
[obj(u,v) ~
Aperature Radius
~a I TCrarenast~tLr$~uSnet Oll i Ju,v )
J
~'(u,v)
~ Synthetic Image
,,(u,v)
9 lim(x',y')
Figure 5. Block Diagram of synthetic image generation. Synthetically generated examples of Fraunhoffer diffraction and geometric blur are shown in Figures 6-8. Figure 6 shows the geometric and out-of-focus images of a 100 micron gear. Figure 7 shows a cross section of the geometric image and the same image after Fraunhoffer diffraction. Notice how the edges of the gear are rounded. Figures 8 shows a cross section of the image which is outof-focus. The geometric blur becomes the dominant effect as the out-of-focus distance increases.
Figure 6. Synthetically generated images. The image on the left is in-focus. The image on the right is a diffracted image which is 25 microns out of depth of field.
Microassembly of Micro-Electro-mechanical Systems Using Visual Servoing
267
250
200
~ 150 ~100 _E
Q
I
100
200
300 X Axis (pixel$)
L
I
400
500
600
Figure 7. Cross section of geometric in-focus image (vertical lines) and dif~acted in-focus image.
250
200
I~ t 5 0
~1oo
~ so _E L.--I,
0
9 50
I
I
I
I
I
100
2130
300 X Axis (pixel$)
400
~00
600
Figure 8. Cross section of a diffracted image which is 25 microns out of depth of field. The cross section due only to geometric blurring is the curve which starts at zero and has peaks at 225. When diffraction is included, the image does not start at zero and the peaks are attenuated.
These results can be compared to real images of a 100 micron gear under a microscope as shown in Figures 9 and 10. Figure 9 shows an image o f the gear when in-focus, and when moved out-of-focus by 30 microns. Figure 10 shows a cross section of the in-focus and out-of-focus images. Notice that the edges of the in-focus image are rounded. Also, notice that the intensity of the out-of-focus
268
Feddema and Simon
image is attenuated and the slope is more gradual than the in-focus image. These results were predicted by the synthetic image (see Figures 6-8). However, the comparison also highlights some effects which are not yet modeled. In particular, the through-the-lens lighting is not uniform, and there are shadowing effects inside the gear. Also, the above analysis is only valid for parts which are all in the same z plane. Nevertheless, we can use this synthetic image to derive reference features for visual servoing, as will be shown next.
Figure 9. Real experimental images. The image on the left is infocus. The image on the right is an image which is 30 microns out of depth offield.
120 110
Infocus
100 90 30 microns out
F
-~
7o
_~ 6o 5O
3O 2O 50
i
1 O0
150
i
200 250 X Axis(pixels)
I
I
300
350
AO0
Figure 10. Cross section of in-focus and out-of-focus images in Figure 9.
Microassembly of Micro-Electro-mechanical Systems Using Visual Servoing
269
4. Off-line Assembly Planning In this section, we show how a synthetic image can be used to test image processing routines and to generate reference image features for control. Much of our work has concentrated on developing an optics simulator and off-line image processing extractor which is used to generate an augmented assembly plan. In Figure 11, the bold boxes represent computer programs which process the data files in the remaining boxes. The off-line system reads in the task plan from one file and the boundary representation of the CAD part from the ".dxf' file. A synthetic image is generated using the Fourier Optics from which a variety of image processing routines are tested and image features are automatically selected for control [ 19].
Task PI . . . . . g.
I . . . . . . . . i a s x r l a n ~lmu|ator 9 Suggests Image Processing [ " | Controllability Analysis [ | Place gear1 on shaftl [ L
Locate gearl
I
~
Vickupgearl Locate shaftl
I ~ | | 1 /
! CAD (dxf file) ] gearl
shaftl
] micro-gripper ] Workcell parameters Magnification CCD size
Aperature
[ m ~ f f i ~ ~Simulator [mTll*tnr I Image
I Augmented Task PI . . . . g. . ' Locate gear1 using a-i nt t h r e s h o " I ~.gr~d e , . . 1a-80 .... [ l~lcgup gearx u s m g I features o~o
/
I
Geometric optics Fourier optics Image processing
I
Teach-by.Showing on J Robotic System
Range of motion
Figure 11. Block Diagram of CAD to Assembly Process.
To date, we have successfully tested using a synthetic image to visually servo a LIGA gear to a desired x,y position. Figures 12-14 show a sequence o f images as the gear on the stage is visually servoed to the reference image position. Figure 12 shows the synthetic image which was generated from the CAD information. The part was recognized and located by finding the center of the part and then searching for 18 gear teeth on the outer diameter and a notch on the inner diameter. Its location in the image serves as the reference position. Figure 13 shows a real part as seen by the microscope and the application of the same image processing routines to locate the gear. Next, the part is visually servoed to the reference position at 30 Hz using the x,y centroid of the gear. Figure 14 shows the final position of the gear after visual servoing. Currently, the repeatability of the visual servoing is 1 micron in the x and y directions.
270
Feddema and Simon
5. Conclusion This paper described a prototype micromanipulation workcell for assembly of LIGA parts. We have demonstrated the ability to visually servo the LIGA parts to a desired x,y position between the tweezers. Fourier optics methods were used to generate a synthetic image from a CAD model. This synthetic image was used to select image processing routines and generate reference features for visual servoing. In the near future, we plan to generate a sequence o f synthetic images which represent assembly steps, e.g. tweezer grasps gear, locate shaft, and put gear on shaft. Again, these images will be used to select image processing routines and generate reference features for visual servoing.
Figure 12. Synthetic reference image.
Figure 13. Initial location of real gear.
Microassembly of Micro-Electro-mechanical Systems Using Visual Servoing
271
Figure 14. Final position after visual servoing. Acknowledgement. A previous version of this paper appears in the Proceedings of the 1998 International Conference on Robotics and Automation.
References [1] P.I. Corke, "'Visual Control of Robot Manipulators - A Review," Visual Servoing: Real-Time Control of Robot Manipulators Based on Visual Sensory Feedback, Ed. K. Hashimoto, World Scientific Publishing Co. Pte. Ltd., Singapore, 1993. [2] S. Hutchinson, G.D. Hager, P.I. Corke, "A Tutorial on Visual Servo Control," IEEE Trans. On Robotics and Automation, Vol. 12, No. 5, pp. 651-670, Oct. 1996. [3] R. Arai, D. Ando, T. Fukuda, Y. Nonoda, T. Oota, "Micro Manipulation Based on Micro Physics - Strategy Based on Attractive Force Reduction and Stress Measurement," Proe. oflCRA 1995, pp. 236-241. [4] R.S. Fearing, "Survey of Sticking Effects for Micro Parts Handling," Proc. oflROS '95, Pittsburgh, PA, August 1995, Vol. 2, pp. 212-217. [5] H. Miyazaki, T. Sato, "Pick and Place Shape Forming of Three-Dimensional Micro Structures from Fine Particles," Proc. oflCRA 1996, pp. 2535-2540. [6] K. Koyano, T. Sato, "Micro Object handling System with Concentrated Visual Fields and New Handling Skills, "Proc. oflCRA 1996, pp. 2541-2548. [7] M. Mitsuishi, N. Sugita, T. Nagao, Y. Hatamura, "A Tele-Micro Machining System with Operation Environment Transmission under a Stereo-SEM," Proc. oflCRA 1996, pp. 2194-2201.
272
Feddema and Simon
[8] W. Zesch, M. Brunner, A. Weber, "Vacuum Tool for Handling Microobjects with a Nanorobot," Proc. oflCRA 1997, pp. 1761-1766. [9] A. Sulzmann, H.-M. Breguett, J. Jacot, "Microvision system (MVS): a 3D Computer Graphic-Based Microrobot telemanipulation and Position Feedback by Vision," Proc. of SPIE Vol. 2593, Philadelphia, Oct. 25, 1995. [10] B. Vikramaditya and B. J. Nelson, "Visually Guided Microassembly using Optical Microscope and Active Vision Techniques," Proc. oflCRA 1997, pp. 3172-3178. [11] A.P. Pentland, "A New Sense of Depth of Field," IEEE Trans. on PAMI, Vol. PAMI-9, No. 4, July 1987. [12] J. Ens and P. Lawrence, "An Investigation of Methods for Determining Depth of Focus," IEEE Trans. on PAMI, Vol. 15, No. 2, February 1993. [13] S.K. Nayar and Y. Nakagawa, "Shape from Focus," IEEE Trans. on PAMI, Vol. 16, No. 8, August 1994. [14] M. Subbarao and T.Choi, "Accurate Recovery of Three-Dimensional Shape from Image Focus,"IEEE Trans. onPAMI, Vol. 17, No. 3, March 1995. [15] J. Feddema, M. Polosky, T. Christenson, B. Spletzer, R. Simon, "Micro-Grippers for Assembly of LIGA Parts," Proc. of Worm Automation Congress '98, Anchorage, Alaska, May 10-14, 1998. [16] M. Bass, Handbook of Optics, 2nd Edition, Vol. II, pp. 17.1-17.52, McGraw-Hill, 1995. [17] L.C. Martin, The Theory of the Microscope, American Elsevier Publishing Company, 1966. [18] G.O. Reynolds, J.B. DeVelis, G.B. Parret, B.J. Thompson, The New Physical Optics Notebook." Tutorials in Fourier Optics, SPIE, 1989. [19] J.T. Feddema, C.S.G. Lee, and O.R. Mitchell, "Weighted Selection of Image Features for Resolved Rate Visual Feedback Control," IEEE Trans. On Robotics and Automation, Vol. 7, pp. 31-47, Feb. 1991.
The Block Island Workshop: Summary Report Gregory D. Hager, David J. Kriegman, and A. Stephen Morse
with contributions #om P. Allen, D. Forsyth, S. Hutchinson, J. Little, N. Harris McClamroch, A. Sanderson, and S. Skaar Center for Computational Vision and Control Departments of Computer Science and ElectricM Engineering Yale University New Haven, CT 06520 USA
S u m m a r y . In the opening chapter, we outlined several areas of research in vision and control, and posed several related questions for discussion. During the Block Island workshop on Vision and Control, these and many related ideas were debated in the formal presentations and discussion periods. The summary reports of these discussions, which can be viewed as h t t p : / / c v c . yale. edu/bi-workshop, html, portray a rich and diverse collection of views and opinions related to topics surrounding vision and control. In this chapter, we attempt to broadly summarize some points of agreement from the discussions at the workshop and to codify the current state of the art and open questions at the confluence of vision and control. Although this document is largely derived from the discussions at the workshop, in the interests of brevity and focus we have chosen to emphasize and amplify specific opinions and ideas. We therefore urge readers to also consult the individual group summaries at the URL as a complement to this document.
1.
W h a t D o e s V i s i o n H a v e t o Offer C o n t r o l ?
Given t h a t the topic of control centers on the issues of modelling and design of systems, it is unsurprising t h a t the general conclusion of the discussions were t h a t vision offers control a set of challenging, often slightly unusual realworld problems. In this sense, vision can be viewed as b o t h a t e s t b e d of control ideas as well as a potential dri.ver of new directions in the field of control. In particular, common threads t h r o u g h o u t the discussions included the fact t h a t vision systems offer a "global" view of the world, vision as a sensor is high!y complex, and vision based systems have novel aspects which m a y lead to the development of new control architectures and related concepts. Finally, although control has historically had a large applications component, it is viewed as an area where theory is increasingly divorced from practice. Vision promises to be an area where close interaction between (new) theory and practice is possible, thus invigorating the field with fresh problems and approaches.
274
Gregory D. Hager et. al
1.1 Architectural
Issues
As noted in many of the individual group summaries, vision has the unique property of being a dense spatial (two-dimensional) sensor in contrast to many of the "classical" sensing technologies used in control systems. As such, it injects novel properties into the control problem. For example, using vision it is possible to simultaneously (in the same image) observe a controlled system and a target toward which the system is to be driven. As a result, given loop stability absolutely accurate positioning can be achieved in the face of calibration error without the use of integrators in the feedback loop. The study and characterization of control architectures making use of this feature of visual sensing remains an open problem. Furthermore, images are high dimensional objects whose dimensionality must be reduced through various types of focusing, filtering, and so forth in order to be used in control systems. The appropriateness of a particular projection of an image depends greatly on the task and the current operating conditions of the system. Thus, implicit in the design of vision-based control is the design of an intelligent observer which can choose the most appropriate method of image processing for optimum task performance. Images themselves, as they are two-dimensional projections of a threedimensional world, are inherently incomplete. Thus, in addition to questions of image processing and estimation, there are surrounding issues related to choosing a desirable or optimal configuration for the observing system, again relative to an associated task. It is worth noting that the design of such "active observers" is one which has long been of interest in vision, and hence this is an area where control could have great impact. 1 . 2 V i s i o n is a " n o n - l o c a l "
sensor
The fact that vision is a spatial sensor implies, in the words of Mark Spong, that you can "see where you are going," echoing discussion in the summaries written by S. Hutchinson and D. Forsyth. For example, much of the work in path planning, which assumes a global map of the environment, can be mapped into the vision domain. In dynamic environments, vision lets a system detect and predict future interactions with other, possibly moving, objects. Thus, the development of a control policy takes on a distinctly "nonlocal" character in both the spatial and time domains. The implication is that geometry, both of the sensor and of the surrounding environment, will come to play a larger role in the development of control algorithms. Combining classical notions of dynamics, stability and robustness with non-local, time-evolving geometric information seems to be an area which remains largely unexplored in the area of control.
The Block Island Workshop: Summary Report
275
1.3 Vision Implies Complexity Control, particularly the formal aspects, makes strong use of modeling assumptions about the world and very specific definitions of task. As noted above and as discussed in several of the summaries, one of the fundamental issues in vision is the complexity of the interaction between the sensor, the environment, and the controlled system. For example, the fact that vision offers, at best, a so-called "2-1/2D" view of the world implies that, in general, either tasks or environments must be arranged so that the information needed to perform a task is available when it is needed. Classical notions of controllability and observability do not deal adequately with the issues surrounding the task-directed acquisition of information. One of the challenges vision offers control is to tame the apparent complexity of the vision and control problem through the use of modeling, formal task specification, and analytical methods which are not well known in the vision community. In particular, ideas and concepts developed in the areas of switching theory, nonlinear estimation, and hybrid systems seem to map well onto this problem area. Hence, vision may provide a rich yet concrete realization of problems where these concepts can be applied, tested, and further developed in both theory and practice.
2. W h a t
Does Control
Have to Offer Vision?
As noted above, control, being fundamentally a study of modeling, design and analysis, offers the promise of new approaches, analytical techniques, and formal tools to the problem of designing vision-based systems. More specifically, control theory has very well-developed notions of problem components such as "task," "plant," and "observer," and similarly formal notions of "stability," "observability," "robustness to uncertainty," and so forth (see particularly the discussion summary of N.H. McClamroch). At a more conceptual level, control brings a slightly different focus to the vision problem. "Pure" vision research deals largely with the extraction of information or interpretation of a visual signal, and has historically focused on the static spatial rather than dynamical properties of imagery. Control brings the perspective of transforming signal to action irrespective of any "intermediate" representation. It directly incorporates the notion of motion and dynamics, and it suggests a focus on temporal as well as spatial aspects of vision as a sensor. On a more pragmatic level, the control perspective tends to focus on specific tasks and what it takes to solve them. In fact, one of the most compelling reasons to combine vision and control is that control using vision forces us to define a doable, physically realizable, and testable subset of the vision problem. In most cases, control problems can be formulated without reference to
276
Gregory D. Hager et. al
explicit geometric reconstruction, recognition, or, in some cases, correspondence - - all problems which are difficult or unsolved problems in the vision domain. Testability arises from the fact that vision in a feedback loop is used to move physical mechanisms, and hence "hard" metrics such as accuracy, speed, and robustness can be directly measured and evaluated. Concrete realizations, in addition to demonstrating the possibilities of vision and control, tend to inject excitement and interest into the broader vision community. Finally, another impact of combining vision and control is to expand the domain of discourse of vision-based systems beyond vision itself. It is generally well-accepted both within and without the vision community that when used to control motion, vision must (or at least should be) augmented with other sensor modalities. For example, grasping an object includes long distance motion when a manipulator comes into contact with or transports an object. However, once in contact, force or tactile sensing comes into play. Vision-based navigation or driving uses vision to keep the vehicle "pointed down the road," but typically relies on inertial sensing or odometry to provide loop stability. Control has a long history of using multiple sensing modalities to accomplish its objectives and thus can contribute to a broader perspective of the role vision plays in animate systems.
3. W h a t are the Successes and Control
and
Challenges
in Vision
Combining vision and control is a topic which has been addressed in a variety of contexts for nearly three decades. We are beginning to see the fruits of these labors in practical domains. These initial successes, in addition to defining useful benchmarks of "where we are," also serve to point the way toward future applications, and serve to define a set of challenges which define the forefront of the field. As discussed by many of the groups (see the summaries of P. Allen, A. Sanderson and S. Hutchinson), many participants were interested in defining "benchmark" problems which would drive the field forward as well as to demonstrate the unique capabilities of vision-based control and control-based vision - - that is, vision and control. In this section, we briefly summarize what the participants agreed were the successes, the likely future applications, and the open problems of the field.
3.1 A Success: A u t o m a t e d Driving Nearly everyone agreed that the most visible and exciting success story in the area of vision-based control is automated driving. Since Prof. E. Dickmanns ground-breaking demonstrations in the late 80's, several groups around the world have successfully fielded systems capable of navigating modern highways at or near typical highway speeds. These systems are rapidly moving
The Block Island Workshop: Summary Report
277
from laboratory demonstration to true prototypes in collaboration with many of the world's automobile manufacturers. During the various presentations and discussions, Professor Dickmanns and other participants in the area of automated driving elucidated many of the lessons this application has for vision and control. These included the following points task-directed vision: Successful vision-based driving systems focus on using vision to provide only the information essential to the control task. Conversely, the fact that vision is used in control provides for continuity and predictability of the visual "signal" which simplies the associated image processing.
- Strongly
a n d layering: Vision is inherently a sensor prone to intermittent failure. Occlusion, changes in lighting, momentary disruptions of features in a visual image, and other unmodeled effects mean that any system must be able to somehow cope with missing or occasionally faulty information. The only means for doing so is to provide for "fallback" measures which can be used to bridge these gaps, or to bring the system to a "safe" state until nominal operating conditions again return.
- Redundancy
a n d e n g i n e e r i n g : Prof. Dickmanns was lauded for his group's ability to concentrate on this problem area, and to generate and test several systems. The complexity of the vision and control problem suggests that only through concentrated, long-term engineering and testing can we better understand system and computational architectures for this problem domain.
- Testing
3.2 T o w a r d F u t u r e Successes: V i s i o n - B a s e d C o n t r o l in S t r u c t u r e d Domains
One of the lessons of automated driving is, unsurprisingly, to choose a domain where the vision problem can be suitably restricted and solved. One of the popular topics during the workshop was speculation about what the most likely successful area for applications of vision-based motion control would be, or what good challenge problems would be. Here we list a few of the possibilities based on these discussions and the presentations at the workshop. F i d u c i a l - B a s e d V i s u a l S e r v o i n g : There are many areas of manufacturing where it is possible to build an appropriate fiducial for vision-based assembly or manipulation directly into production of the part or unit. In this case, the vision problem is largely solved through artificial markers. For example, F. Chaumette illustrated some of these issues in the assembly of an electrical connector using appropriately constructed fiducials. Furthermore, these markers can be designed with the view constraints of the assembly system in mind, and so many of the complexities associated with vision do not arise.
278
Gregory D. Hager et. al
However, this introduces a new geometric planning problem during the design process. M a n i p u l a t i o n at a M i c r o s c o p i c Level: The chapter by John Feddema discusses a novel application of hand-eye coordination methods to the problem of constructing of micro-machines. These machines, which can only be seen under a microscope, are so small that position control or force control are out of the question. Optical sensing therefore provides one of the few possibilities for manipulating these devices. Similar problems arise in diverse domains such as micro-biology (cell measurement or implantation), medicine (for example, various types of eye surgery), and inspection. The microscopic world has several novel attributes. For example, in a typica] vision system, focus and depth of field are less important than field of view and occlusion. However, the case is nearly the opposite at microscopic scales. Disturbances in the "macro" world typically stem from exogenous forces, vibration, and so forth. In the microscopic world, simple heating and cooling of the manipulator can inject large inaccuracies into the system. Hence, novel visual tracking and servoing techniques (for example, focus-based positioning) may be necessary to perform microscopic visual tasks. Aids for t h e H a n d i c a p p e d : As illustrated in the chapter by Carceroni et a], another area where vision may play a role is in aids for the handicapped or elderly. For example, consider building a "smart" wheelchair which can move about a house, enabling its occupant to carry large or unwieldy objects that would otherwise make it impossible to move about with a joystick. Or, perhaps large stores would offer transport to elderly or disabled persons who would otherwise have difficulty moving about. The principle advantage of this problem area is that, like driving, the situations are relatively controlled - in particular a home is a nearly static environment with well-defined traffic areas and therefore could, in principle, be "mapped." Going one step further, as suggested in the summary of Art Sanderson, combining mobility and manipulation is a closely related area that will at once greatly improve the functionality of mobile systems as well as making use of existing results in hand-eye systems. The chapter by Tskaris et al. develops several interesting points in this direction.
4. W h a t
are the Open
Challenges?
A great deal of the discussion at the workshop revolved arounding identifying the key roadblocks to further advancements to the theory and practice of vision-based control. The following sections briefly delineate some of the general conclusions of the participants as to what research topics are likely to have the most impact in moving the field forward.
The Block Island Workshop: Summary Report
279
R o b u s t Vision: There was a general consensus among the workshop participants that the greatest current hurdle to progress in vision-based control was the robustness of most vision processes. In general, a set of basic control concepts of vision-based motion control are by now well-understood. Likewise, technology has advanced to the point that implementing vision-based systems is no longer a significant hurdle in practice. Why, then, is vision-based control not in wider use in those areas where it confers obvious advantages, e.g. hazardous environments, flexible automation, and so forth? A clear answer is the fact that most laboratory demonstrations do not provide a convincing demonstration that vision-based control is robust, reliable and practical in real environments. One clear difference (there are many) is that most laboratory demonstrations choose the task and environment to make the vision problem easily solvable - - objects are black, the background is white, no object occlusion occurs, and repeatability of performance is not considered. The real world is not so kind as recent results in vision-based driving have illustrated. Hence, developing "fail-safe" vision systems should be a top research priority in vision and control. P r o g r a m m i n g a n d I n t e g r a t i o n : The practicality of vision-based control depends not only on the effectiveness of a single system, but also on the effectiveness of system construction. Thus, the ability to quickly integrate vision and control into a larger system is a crucial issue, and one which has also received little attention in the community. For example, vision is a spatial sensor, good for moving objects into contact. However, at the point of contact, force or tactile-based control must take over. How can this be accomplished? More generally, many tasks, particularly those in unstructured environments, involve a series of complex object manipulations. While the concepts of programming a manipulator have been well understood for at least three decades, equivalent programming concepts for vision-based manipulation are still completely lacking. In addition to programming concepts, the integration of vision and control in a larger task involves developing practical methods for combining discrete and continuous components, hopefully supported by the appropriate theoretical analysis of these methods. P e r f o r m a n c e : With a few notable exceptions, including juggling, air hockey, ping-pong and driving, most vision based control systems perform point-topoint motion tasks to static targets. The underlying models of the robot and the vision system are, at best, first order. Hence, stability and reliability depends on slow robot motion, thereby avoiding dynamical effects and minimizing the effect of time delay on stability. The chapter by Corke admirably illustrates this point. To be practical, vision-based control systems must be able to routinely move quickly and precisely. This implies that vision and control must not only work quickly, but that they must be well-coordinated. A fast transport
280
Gregory D. Hager et. al
to an approximately defined intermediate location may be performed "openloop" with vision only acting as a monitor, whereas as an object comes close to contact, control must switch to a more precise, closed-loop mode of operation. Developing the appropriate continuous controllers a n d / o r discrete switching logics for such operations remains an open problem. L e a r n i n g a n d V i s i o n : As discussed above, one of the fundamental limiting factors in vision and control is the richness and attendant complexity of vision as a sensing modality. One way of cutting the Gordian knot of complexity in visual images is to let the system train itself for a particular task. Several of the discussion groups alluded to this idea. In particular, repetitive, dynamical tasks would clearly lend themselves to this approach.
5. S u m m a r y and C o n c l u s i o n s After three days of presentations and discussions, one of the questions we returned to during the final discussion session was "What defines or is novel about the area of vision and control?" To put it another way, is there something unique about vision and control which both delimits it as a problem area, and is there something to this problem area which is more than "just control applied to vision?" Based on the preceding and the summary discussions, it would seem t h a t one can confidently answer yes to both questions, although with some caveats. Briefly, it would seem that some of the reasons for considering a focus on vision and control in its own right are: - I m p a c t : It is clear that there is a plethora of problems, many which would have immediate impact on productivity, safety, or quality of life, t h a t would be partially or completely solved given the availability of cheap, reliable and dependable vision-based control technology. -
-
D o a b i l i t y : L a b o r a t o r y demonstration of vision-based control systems exist - - systems which in suitably restricted circumstances can be applied to real-world problems. Given the constant advance of computer technology along with an increased understanding of the scientific issues, these systems will themselves become more powerful and applicable to a wider range of problems. This ability to incrementally build on existing successes makes vision and control an ideal area for constant advancement in the state of the art. C o n c r e t e n e s s : As argued above, one of the advantages of combining vision and control is the ability to evaluate in a rigorous manner the performance of the resulting system. This stands apart from many areas of vision research (e.g. shape representation, perceptual organization, or edge
The Block Island Workshop: Summary Report
281
detection) where the generality and complexity of the problem has defied the development of such rigorous metrics. for Fundamental Theoretical Developments: A common thread in much of the discussion was the fact that, while it is clear that existing vision and control systems involve an often complex control logic, a theoretical framework for the development and/or analysis of the logic together with lower-level continuous control loops is lacking. Conversely, hybrid systems research offers to provide exactly the insights needed in this area. Hence, vision and control can be seen as an area pushing the boundaries of our theoretical understanding. Conversely, vision has long been an area where control logic is largely developed heuristically. In particular, focus of attention, long driven by observations of human performance, has lacked a widely accepted concrete basis. Successful development of a theory of "attention" based on control concepts would constitute a significant achievement in the field of vision. Finally, vision has historically been devoted to static image analysis. The potential for new discoveries in vision based on a dynamical approach to vision problems seems promising.
- Potential
The caveats to placing strong boundaries around a "field" of vision and control are largely the danger of the field closing itself off to obvious and advantageous connections to the larger areas of vision research and control research. As noted in the summaries, vision must often be used in the context of other sensing modalities: force, tactile, GPS and so forth. Focusing solely on vision and control has the danger of making problems "harder than they are" though myopia. The second observation is that, in general, advances in the field of vision and control inevitably lead toward the ability to construct more general vision systems. Hence, as S. Zucker pointed out, vision and control must continue to look to "general" vision science for ideas and inspiration. In summary, the outlook for vision and control is bright. Applications abound, technology continues to make such systems more cost-effective and simpler to build, and interesting theoretical questions beckon. By all accounts, we can look forward to the next millenium as an era when vision-based systems slowly become the rule rather than the exception.
Acknowledgement. Greg Hager was supported by NSF IRI-9420982, NSF IRI9714967 and ARO-DAAG55-98-1-0168. David Kriegman was supported under an NSF Young Investigator Award IRI-9257990 and ARO-DAAG55-98-1-0168. A.S. Morse was supported by the NSF, AFOSR and ARO.