Foreword
As robots are usedto perform increasinglydifficult tasks in complex and unstructured environments, it becomesmore of a challenge to reliably program their behavior. There are many sourcesof variation for a robot control program to contend with , both in the environment and in the performance characteristicsof the robot hardware and sensors. Because these are ofteQ too poorly understood or too complex to be adequately managed with static behaviors hand- coded by a programmer, robotic systemssometimeshave undesirablelimitations in their robustness, efficiency , or competence. Machine learning techniquesoffer an attractive and innovative way to overcometheselimitations. In principle, machine learning techniquescan allow a robot to success fully adapt its behavior in responseto changing circumstanceswithout the intervention of a human programmer. The exciting potential of robot learning has only recently been investigated in a handful of groundbreaking experiments. Considerably more work must be done to progress beyond simple researchprototypes to techniquesthat have practical significance. In particular, it is clear that robot learning will not be a viable practical option for engineersuntil systematicproceduresare available for designing, developing, and testing robotic systemsthat use machine learning techniques. This book is an important first step toward that goal. Marco Dorigo and Marco Colombetti are two of the pioneersmaking significant contributions to a growing body of researchin robot learning. Both authors have years of hands-on experiencedeveloping control programs for a variety of real robots and simulated artificial agents. They have also made significant technical contributions in the subfield of machine learning known are " reinforcement learning" . Their considerable knowledge about both robotics and machine learning, together
Foreword
with their sharp engineering instincts for finding reliable and practical solutions to problems, has produced several successfulimplementations of learning in robots. In the process, the authors have developedvaluable . Their insights, ideas, insights about the methodology behind their success and experiencesare summarized clearly and comprehensively in the chaptersthat follow . Robot Shaping: An Experiment in BehaviorEngineeringprovides useful guidancefor anyone interestedin understandinghow to specify, develop, and test a robotic systemthat usesmachine learning techniques. The behavior analysis and training methodology advocatedby the authors is a well-conceivedprocedurefor robot developmentbasedon three key ideas: analysis of behavior into modular components; integration of machine learning considerationsinto the basic robot design; and specification of " a training strategy involving step-by-step reinforcement (or " shaping ). While more work remains to be done in areaslike behavior assessment , the elementsof the framework describedhere provide a solid foundation for others t ~ use and build on. Additionally , the authors have tackled some of the open technical issuesrelated to reinforcement learning. The " " family of techniquesthey use, referred to as classifiersystems , typically requires considerable expertise and experienceto implement properly. The authors have devisedsomepragmatic ways to make thesetechniques both easierto use and more reliable. Overall, this book is a significant addition to the robotics and machine " learning literature. It is the first attempt to specify an orderly behavior " engineering procedurefor learning robots. One of the goals the authors set for themselveswas to help engineersdevelop high-quality robotic systems . I think this book will be instrumental in providing the guidance engineersneedto build such systems.
LashonB. Booker The MITRE Corporation
Preface
This book is about designingand building learning autonomous robots. An autonomousrobot is a remarkableexampleof a devicethat is difficult to design and program becauseit must carry out its task in an environment at leastpartially unknown and unpredictable, with which it interacts through noisy sensorsand actuators. For example, a robot might have to move in an enpronment with unknown topology, trying to avoid still objects, peoplewalking around, and so forth ; to senseits surroundings, it might have to rely on sonarsand dead reckoning, which are notoriously inaccurate. Moreover, the actual effect of its actions might be very difficult to describea priori due to the complex interactions of the robot ' s effectorswith its physical environment. There is a fairly widespreadopinion that a good way of solving such problems would be to endow autonomous robots with the ability to acquire knowledge from experience, that is, from direct interaction with their environments. This book explores the construction of a learning robot that acquires knowledge through reinforcement learning, with reinforcementsprovided by a trainer that observesthe robot and evaluates its performance. One of our goals is to clarify what the two terms in our title , robot shapingand behaviorengineering , mean and how they relate to our main researchaim: designingand building learning autonomous robots. Robot shaping uses learning to translate suggestionsfrom an external trainer into an effective control strategy that allows a robot to achieve a goal. We borrowed " shaping" from experimental psychology becausetraining an artificial robot closely resembleswhat experimental psychologistsdo when they train an experimentalsubjectto produce a predefinedresponse. Our approachdiffers from most current researchon ~ Jearninga\J.QnQmQU robots in one important respect: the trainer plays a fundamental role in
Preface
the robot learning prdcess. Most of this book is aimed at showing how to use a trainer to developcontrol systemsfor simulated and real robots. We propose the term behaviorengineeringto characterizea new technological discipline, whose objective is to provide techniques, methodologies , and tools for developing autonomous robots. In chapter 7 we " describeone suchbehavior engineeringmethodology- behavior analysis " and training or BAT. Although discussedonly at the end of the volume, the BATmethodology permeatesall our researchon behavior engineering and, as will be clear from reading this book , is a direct offspring of our robot -shapingapproach. -. The book is divided into eight chapters. Chapter 1 servesas an introduction to the researchpresentedin chapters2 through 7 and focuseson the interplay between learning agents, the environments in which the learning agents live, and the trainers with which they interact during learning. Chapter 2 provides background on ALECsys, the learning tool we have used in our experimentalwork . We first introduce LCSo, a learning clas' sifier system: trongly inspired by Holland s semina] work , and then show how we improved this systemby adding new functionalities and by parallelizing it on a transputer. Chapter 3 discusses architectural choices and shaping policies as a function of the structure of behavior in building a successfullearning system. Chapter 4 presentsthe resultsof experimentscarried out in many different simulated environments, while chapter 5 presentsthe practical resultsof our robot -shapingapproach in developingthe control systemof a real robot. Extending our model to nonreactive tasks, in particular to sequential behavior patterns, chapter 6 showsthat giving the learning systema very simple internal state allows it to learn a dynamic behavior; chapter 7 describesBAT, a structured methodology for the developmentof learning autonomoussystemsthat learn through interaction with a trainer. Finally , chapter 8 discussesrelated work , draws someconclusions, and gives hints about directions the researchpresentedin this book may take in the near future.
Acknowledgments -
Our work was carried out at the Artificial Intelligenceand Robotics Projectof Politecnicodi Milano (PM-AI&R Project). We aregratefulto all membersof the projectfor their support, and in particular, to Marco ' Somalvico , the projects founderand director; to AndreaBonarini, who in addition to researchand teachingwas responsiblefor most of the local organizationwork; and to GiuseppeBorghi, VincenzoCaglioti, GiuseppinaGi'ni, and DomenicoSorrenti, for their invaluablework in the roboticslab. . The PM-AI&R Projecthasbeenan idealenvironmentfor our research we our with , colleagues Besidesenjoyingmany stimulatingdiscussions on real madefullestuseof the laboratoryfacilitiesfor our experiments SfER.and robots. GiuseppeBorghihad an activerole inrunning the HAM CRAB the 7 for in , we experiment CRABexperiments presented chapter ; at a who spent year the could rely on the cooperationof MukeshPatel, Fellowshipfundedby the EC PM-AI&R Projecton an ERCIM Research a Prato Previde . Emanuel , of Human Capital and Mobility Programme Milano di Universiti Medicine , of , the Instituteof Psychology , Faculty with connected issues with us severalconceptual experimental discussed . psychology Our work hasreceivedthefinancialsupportof a numberof Italian and /0" grant from a " 600 Europeaninstitutions. We gratefullyacknowledge MURST (Italian Ministry for University and Scientificand Techno) to Marco Colombetti for the years 1992 1994; a logical Research " " UniversityFund grant from Politecnicodi Milano to Marco Colom betti for theyear 1995;andan IndividualEC HumanCapitaland Mobility ProgrammeFellowshipto Marco Dorigo for the years1994- 1996. Many studentshave contributedto our researchby developingthe . We software , buildingsomeof the robots, and runningthe experiments
xviii
Acknowledgments
wish to thank them all. In particular, Enrico Sirtori , Stefano Michi , Roberto Pellagatti, Roberto Piroddi, and Rino Rusconi contributed to the developmentof ALBCSYS and to the experimentspresentedin chapters 4 and 5. Sergio Barbesta, Jacopo Finocchi, and Maurizio Goria contributed to the experimentspresentedin chapter 6; Franco Dorigo and Andrea Maesani to the experiments presentedin sections 5.3 and 7.3; Enrico Radice to the experiments presented in section 7.4; Massimo Papetti to the experiments presented in section 7.5. Autono Mouse II was designed and built by Graziano Ravizza of Logic Brainstorm; Autono Mouse IV and Autono Mouse V were designed and built by Franco Dorigo . Marco Dorigo is grateful to all his colleaguesat the I Rm I A lab of Universire Libre de Bruxellesfor providing a relaxed yet highly stimulating working environment. In particular, he wishesto thank Philippe Smets and Hugues Bersini for inviting him to join their group, and Hugues Bersini, Gianluca Bontempi, Henri Darquenne, Christine Decaestecker , Christine Defrise, Gianni Di Caro, Vittorio Gorrini , Robert Kennes, Bruno Marc Hal, Philip Miller , Marco Saerens , Alessandro Saffiotti, Tristan Salome, Philippe Smets, Thierry Vande Merckt , and Hong Xu for their part in the many interesting discussionshe and they had on robotics, learning, and other aspectsof life. Someof the ideaspresentedin this book originated while Marco Dor igo was a graduate student at Politecnico di Milano . Many people helped make that period enjoyable and fruitful . Marco Dorigo thanks Alberto Colorni for his advice and support in writing a good dissertation, FrancescoMaffioli and Nello Scarabottolo for their thoughtful commentson combinatorial optimization and on parallel architectures, and Stefano Crespi Reghizzi for his work and enthusiasm as coordinator of the graduate study program. Finally , a big thank-you to our families, and to our wives, Emanuela and Laura, in particular . This book is dedicatedto them, and to Luca, the newborn son of Laura and Marco Dorigo .
1 Chapter Robots Shaping
1.1 I NTRODUCfl ON
of learningrobotic on thedevelopment Thisbookis aboutour research of thePolitecnico Robotics and the Artificial at Project Intelligence agents robotic autonomous to is di Milano. Ourlong-termresearch develop goal . The results in natural environment behavior a agents capablfof complex ~ in thebookcanonlybeviewedasa firststepin thisdirection reported . one a we , , although hope significant in autonomous agentson Althoughthepotentialimpactof advances of itself the for roboticsandautomation , development autonomous speaks task, involvinghighlysophisticated robotsis a particularlydemanding mechanics , control, andthe like. , datacollectionandprocessing of agentdevelopmen for a unified is need there we believe Indeed discipline , " " this we ; hope bookwill provide engineering , whichwecall behavior . in thisdirection a contribution of similar,well-established is reminiscent Thetermbehavior engineering . Byintroducing and termslikesoftware engineering engineering knowledge involved the of the stress want to we a newterm, problems specificity the close robots of autonomous in the development , and relationship ' of artificialagentsand the naturalsciences the development between of thenotion . Althoughtherelevance of organisms studyof thebehavior see been in has robotics behavior , of ( , for example explicitlyrecognized 1990 Brooks Maes and 1994 ; Brooks1991 , ; ; Maes1990 ; Arkin 1990 1991 Steels1990 , and , Ruspini ; Saffiotti , 1993 ; DorigoandSchnepf , 1994 that , andRuspini1995 ), webelieve , Konolige ; andSaffiotti Konolige1993 been not have sciences behavioral and robotics thelinksbetween fully yet " . appreciated
Chapter1
It can be saidthat the mechanisms of learning, and the complexbalance betweenwhat is learnedand what is geneticallydetermined , arethe main concernof behavioralsciences . In our opinion, a similar situation arisesin autonomous robotics, wherea majorproblem . is in decidingwhat shouldbe explicitly designedand what shouldbe left for the robot to learnfrom experience . Because it may be very difficult, if not impossible , for a humandesignerto incorporateenoughworld knowledgeinto an agentfrom the very beginning , we believethat machinelearningtechniques will playa central role in the developmentof robotic agents . Agentscan reacha high performancelevel only if they are capableof " extractingusefulinformationfrom their " experience , that is, from the history of their interactionwith the environment . However,the development of interestingbehaviorpatternsis not merelya matterof agentselforgani . Robotic agentswill also have to perform tasksthat are " so usefulto us. Agents' behaviorwill thushaveto be properly" shaped that a predefined taskis carriedout. Borrowinga termfrom experimental " . In psychology(Skinner1938 ), we call sucha process" robot shaping this book we advocatean approachto robot shapingbasedon reinforcement learning. In addition to satisfyingintellectualcuriosity about the power of machinelearningalgorithmsto generateinterestingbehaviorpatterns , autonomousrobots are an ideal test bed for severalmachinelearning . Many journals haverecentlydedicatedspecialissuesto the techniques applicationof machinelearning techniquesto the control of robots (Gaussier1995 ; Krose 1995;Dorigo 1996; Franklin, Mitchell and Thrun 1996 ). Moreover, the useof machinelearningmethodsopensup a new rangeof possibilitiesfor the implementationof robot controllers . Architecture as neuralnetworksand learningclassifiersystems , for example , cannotbe practicallyprogrammedby hand- if we want to usethem, we haveto rely on learning. And therearegoodreasonswhy we might want to usethem . : sucharchitectures havepropertieslike robustness , reliability, and the ability to generalizethat can greatly contributeto the overall quality of the agent. Evenwhenlearningis not strictly necessary that , it is often suggested an appropriateuse of learningmethodsmight reducethe production costof robots. In our opinion, this statementhasyet to be backedwith sufficientevidence ; indeed , nontrivial learningsystemsare not easyto developand use, and the computationalcomplexityof learningcan be. veryhigh.
Robots Shaping On the other hand, it seemsto us that the reduction of production costs should not be sought too obsessively . Recent developmentsin applied in science software , computer particulary engineering(see, for example, Sommerville 1989; Ghezzi, Jazayeri, and Mandrioli 1991) have shown that a reduction of the global cost of a product, computed on its whole life cycle, cannot be obtained by reducing the costs of the development phase. On the contrary, low global costs demand high product quality , which can only be achieved through a more complex and expensive developmentprocess. The use of learning techniquesshould be viewed as a means to achievehigher behavioral quality for an autonomous agent; the usefulnessof the learning approach should thereforebe evaluatedin a wider context, taking into account all relevant facetsof agent quality . We believethis to be a strong argument in favor of behavior engineering, that is, an integrated treatment of all aspectsof agent development, including designand the use of learning techniques.
1.2 LEARNINGTO BEHAVE
The term learningis far too generic; we therefore needto delimit the kind of learning models we have in mind. From the point of view of a robot developer, an important differencebetweendifferent learning models can be found in the kind and complexity of input information specifically required to guide the learning process. In these terms, the two extreme approaches are given by supervisedlearning and by self-organizing systems . While the latter do not use any specific feedback information to implement a learning process, the former require the definition of a training set, that is, a set of labeledinput - output pairs. Unfortunately, in a realistic application the construction of a training set can be a very demanding task, and much effort can be wasted in labeling sensory configurations that are very unlikely to occur. On the other hand, selforganiza has a limited range of application. It can only be used to discover structural patterns already existing in the environment; it cannot be usedto learn arbitrary behaviors. In terms of information requirements, an intermediate case is represented by reinforcementlearning ( RL ). RL can be seenas a classof problems in which an agent learns by trial and error to optimize a function " " (often a discounted sum) of a scalar called reinforcement (Kaelbling, Littman , and Moore 1996). Therefore, algorithms for solving RL problems do not require that a complete training set be specified: they only
Chapterl
need a scalar evaluation of behavior (that is, a way to compute reinforcem ) in order to guide learning. Such an evaluation can be either produced by the agent itself as the result of its interaction with the environmen or provided by a trainer, that is, an external observercapableof how well the agent approximates the desired behavior (Dorigo judging and Colombetti 1994a, 1994b; Dorigo 1995) . In principle, of course, it would be preferableto do without an external trainer: an artificial systemable to learn an appropriate behavior by itself is the dream of any developer. However, the use of a strictly internal reinforcementprocedureposesseverelimitations. A robot can evaluateits own behavior only when certain specificeventstake place, thus signaling that a desirable or undesirable state has been reached. For example, a ' positive reinforcement, or reward, can be produced once the robot s sensors detect that a goal destination has beenreached, and a negative reinforcem , or punishment, can be generatedonce an obstaclehas beenhit . Suchdelayedreinforcementshavethen to be usedto evaluatethe suitability of the robot ' s past course of action. While this processis theoretically possible, it tefodsto be unacceptably inefficient: feedback information is often too rare and episodicfor an effectivelearning processto take place in realistic robotic applications. A way to bypass this problem is to use a trainer to continuously monitor the robot ' s behavior and provide immediatereinforcements . To produce an immediate reinforcement, the trainer must be able to judge how well each single robot move fits into the desired behavior pattern. For example, the trainer might have to evaluatewhether a specificmovement brings the robot closer to its goal destination or keepsthe robot far enough from an obstacle. Using a human trainer would amount to a direct translation of the high-level, often implicit knowledge a human being has about problem solving into a robotic control program, avoiding the problems raised by the need to make all the intermediate steps explicit in the design of the robotic controller. Unforeseenproblemscould arise, however. Human evaluations could be too noisy, inaccurate, or inconsistent to allow the learning algorithm to convergein reasonabletime. Also , the reaction time . of humans is orders of magnitude slower than that of simulated robots, which could significantly increasethe time requiredto achievegood performance when training can be carried out in simulation. On the other hand, this problem will probably not emergewith real robots, whose reaction time is much slower due to the dynamicsof their mechanicalparts.
ShapingRobots
More researchis necessaryto understand whether the human trainer approach is a viable one. For the time being, we limit our attention to robots that learn by exploiting the reinforcementsprovided by a reinforceme program ( RP), that is, by an artificial trainer implementedas a computer program. The useof an RP might appear as a bad idea from the point of view of our initial goals. After all , we wanted to free the designerof the need to ' completely specify the robot s task. If the designer has to implement a detailed RP, why should this be preferable to directly programming the robot ' s controller? One reason is of an experimental nature. Making our robots learn by using an RP is a first, and simpler, step in the direction of using humanbeings: a successwith RPs opensup the door to further researchin using human-generatedreinforcements. But even if we limit our attention to RPs, there are good motivations to the RP approach, as opposed to directly programming the robot controller. Writing an RP can be an easier task than pro~ amming a controller becausethe information neededin the former caseis : noreabstract. To understandwhy it is so, let us restrict our attention to reactiveagents, that is, to agentswhoseactions are a function 1 solely of the current sensoryinput . The task of a reactivecontroller is to ; produce a control signal associatedwith the signalscoming from sensors the control messageis sent to the robot effectors, which in turn produce a physical effect by interacting with the environment. Therefore, to define a controller, the designermust know the exact format and meaning of the messagesproduced by the sensorsand required by the effectors. Moreover , the designerhas to rely on the assumptionthat in all situations the physical sensorsand effectorswill be consistentlycorrect. Suppose, for example, that we want a mobile robot with two independent wheelsto move straight forward. The designerwill then specify that both wheelsmust turn at the sameangular velocity. But supposethat for somereasonone wheelhappensto be somewhatsmallerthan the other. In this situation, for the robot to move straight on, the two wheelsmust turn at different velocities. Defining a reactive controller for a real robot (not an ideal one) thus appearsto be a demanding task: it requires that the designerknow all the details of the real robot , which often differ slightly from the ideal robot that servedas a model. Now consider the definition of an RP to train the robot to move straight forward. In this case, the designermight specify a reinforcement related to the direction in which the agent is moving, so that the highest
Chapter} evaluation is produced when such a direction is straight ahead. There is no need to worry about smaller wheelsand similar problems: the system will learn to take the control action, whichever it may be, that produces the desired result by taking into account the actual interaction between the robot and its environment. As often happens, the sameproblem could be solved in a different way. For example, the designermight design a feedbackcontrol systemthat makesthe robot move straight forward even with unequal wheels. However, this implies that the designeris aware of this specific problem in order to implement a specific solution. By contrast , RL solvesthe sameproblem as a side effect of the learning process and doesnot require the designereven to be aware of it .
1.3 SHAPINGAN AGENT' SBEBAVIOR As we have pointed out in the previous section, we view learning mechanisms as a way of providing the agent with the capacity to be trained. Indeed, we h~ve found that effectivetraining is an important issueper se, both conceptually and practically. First , let us analyzethe relationship betweenthe RP and the final controller built by the agent through learning. A way of directly programming a controller, without resorting to learning, would be to describethe robot ' s behavior in some high-level task-oriented language. In such a " " language, we would like to be able to say things like Reach position P " " or Grasp object 0 . However, similar statementscannot be directly translated into the controller' s machinecode unlessthe translator can rely on a world model, which representsboth general knowledge about the physical world and specificknowledge about the particular environment the robot acts in (figure 1.1) . Therefore, world modeling is a central issue for task-basedrobot programming. Unfortunately, everybody agreesthat world modeling is not easy, and somepeople even doubt that it is possible. For example, the difficulty of world modeling has motivated Brooks to radically reject the use of symbolic " , with the argument that the world is its own best representations " model ( Brooks 1990, 13) . Of course, this statementshould not be taken too literally (a town may be its own bestmap, but we do find smallerpaper maps useful after all) . The point is that much can be achievedwithout a world model, or at least without a human- designedworld model. While there is, in our opinion, no a priori reasonwhy symbolic models should be rejected out of hand, there is a problem with the models that
Shaping Robots
-level ion high descript ofbehavior Translat or world model program control
1.1 Ji1gure
Translating high-level description of behavior into control program
human designerscan provide. Such models are biased for at least two reasons: ( 1) they tend to reflect the worldview determinedby the human sensorysystem; and (2) they inherit the structure of linguistic descriptions usedto formulate them off-line. An autonomousagent, on the other hand, should build its own worldview, basedon its sensorimotorapparatusand " " which cognitive abilities ( might not include anything resemblinga linguistic ' . ability) One possiblesolution is to usea machinelearning system. The idea is to let the agent organize its own behavior starting from its own " experience " , that is, the history of its interaction with the environment. Again, we start from a high-level description of the desiredbehavior, from which we build the RP. In other words, the RP is an implicit procedural representatio of the desiredbehavior. The learning processcan now be seenas the processof translating such a representationinto a low -level control program- a process, however, that does not rely on a world model and that is carried out through continuous interaction betweenagent and environme , then, behavioral learning with a trainer (figure 1.2). In a sense is a form of groundedtranslation of a high-level specificationof behavior into a control program. If we are interestedin training an agentto perform a complex behavior, we should look for the most effectiveway of doing so. In our experience,a good choice is to impose somedegreeof modularity on the training process . Instead of trying to teach the complete behavior at once, it is better to split the global behavior into components, to train each component separately, and then to train the agent to correctly coordinate the component behaviors. This procedureis reminiscentof the shapingprocedure that psychologistsuse in their laboratories when they train an animal to produce a predefinedresponse.
Chapter I high - level description of behavior
reinforcement
Learning system
physical environment
control program
flgare1.2 Learning as grounded translation
1.4 REINFORCEMENT LEARNING COMPUT AnON, , EVOLUDONARY ANDLEARNING CLASSIFIER SYSTEMS As we said, reinforcementlearning can be definedas the problem faced by an agent that learns by trial and error how to act in a given environment, receiving as the only source of learning information a scalar feedback known as " reinforcement" . Recently there has been much researchon algorithms that lend themselves to solving this type of problem efficiently. Typically, the result of the agentlearning activity is a policy , that is, a mapping from statesto actions that maximizes some function of the reward intake (where rewardr are positive reinforcements, while punishmentsare negativereinforcements). The study of algorithms that find optimal policies with respectto some objective function is the object of optimal control. Techniques to solve RL problems are part of " adaptive optimal control" (Sutton, Barto, and Williams 1991) and can take inspiration from existing literature (see Kaelbling, Littman , and Moore 1996). Indeed, an entire class of algorithms usedto solve RL problems modeledas Markov decisionprocesses Bertsekas 1987; Puterman 1994) took inspiration from dynamic programming ( Bellman 1957; Ross1983). To thesetechniquesbelong, amongothers, ( the " adaptive heuristic critic " ( Barto, Sutton, and Anderson 1983) and
Robots Shaping " " Q Iearning ( Watkins 1989), which try to learn an optimal policy without " " learning a model of the environment, and Dyna (Sutton 1990, " 1991) and " real-time dynamic programming ( Barto, Bradtke, and Singh 1995), which try to learn an optimal policy while learning a model of the environment at the sametime. All the previously cited approaches try to learn a value function. Informally , this meansthat they try to associatea value to eachstateor to each state-action pair such that it can be used to implement a control policy . The transition from value functions to control policies is typically very easyto implement. For example, if the value function is associatedto state-action pairs, a control policy will be defined by choosing in every state the action with the highestvalue. Another classof techniquestries to solveRL problemsby searchingthe spaceof policies instead of the spaceof value functions. In this case, the object of learning is the entire policy, which is treated as an atomic object. Examplesof thesemethodsare evolutionary algorithms to developneural network controllers ( Dress 1987; Fogel, Fogel, and Porto 1990; Beer and Gallagher'1992; Jacoby, Husbands, and Harvey 1995; Floreano and Mondada 1996; Moriarty and Mikkulainen 1996; and Tani 1996), or to develop computer programs (Cramer 1985; Dickmanns, Schmidhuber, and Winklhofer 1987; Hicklin 1986; Fujiki and Dickinson 1987; Grefenstette, Ramsey, and Schultz 1990; Koza 1992; Grefenstetteand Schultz 1994; and Nordin and Banzhaf 1996). In a learning classifiersystem ( LCS), the machine learning technique that we have investigatedand that we have usedto run all the experiments presentedin this book, the learned controller consists of a set of rules " " (called classifiers ) that associatecontrol actions with sensoryinput in order to implement the desiredbehavior. The learning classifier system was originally proposed within a context very different from optimal control theory. It was intended to build rule based systemswith adaptive capabilities in order to overcome the brittleness shown by traditional handmade expert systems ( Holland and Reitmann 1978; Holland 1986). Nevertheless , it has recently been shown that by operating a strong simplification on an LCS, one obtains a systemequivalent to Q-Iearning (Dorigo and Bersini 1994). This makes the connection betweenLCSs and adaptive optimal control much tighter. " " The LCS learns the usefulness ( strength ) of classifiersby exploiting " a " the bucket brigade algorithm, temporal differencetechnique (Sutton
Chapterl 1988) strongly related to Q-learning. Also , the LCS learns new classifiers by using a genetic algorithm . Therefore, in a LCS there are aspectsof both the approaches discussedabove: it learns a value function (the strength of the classifiers), and at the sametime it searches the spaceof possiblerules by exploiting an evolutionary algorithm . A detailed description of our implementation of an LCS is given in chapter 2.
1.5 THEAGENTS ANDTlIEIRENVIRONMENTS
Our work has been influenced by Wilson' s " Animat problem" ( 1985, 1987), that is, the problem of realizing an artificial systemable to adapt and survive in a natural environment. This meansthat we are interestedin behavioral patterns that are the artificial counterparts of basic natural , like feeding and escapingfrom predators. Our experimentsare responses therefore to be seen as possible solutions to fragments of the Animat problem. Behavior is a product of the interaction betweenan agent and its environmen . The' set of possiblebehavioral patterns is therefore determined by the structure and the dynamicsof both the agent and the environment, and by the interface between the two , that is, the agent' s sensorimotor apparatus. In this section, we briefly introduce the agents, the behavior patterns, and the environmentswe shall usein our experiments. 1.5.1 De Agents' " Bodies" Our typical artificial agent, the Autono Mouse, is a small moving robot. Two examples of Autono Mice, which we call " Autono Mouse II " and " Autono Mouse IV " are shown in figures 1.3 and 1.4. The simulated , Autono Mice used in the experimentspresentedin chapters4 and 6 are, unlessotherwisestated, the modelsof their physical counterparts. We also use Autono Mouse V , a slight variation of Autono Mouse IV , which will be describedin chapter 7. Besidesusing the homemadeAutono Mice, we have also experimented with two agentsbasedon commercial robots, which we call " HAM SIB R" " " and CRAB. HAMSTER (figure 1.5) is a mobile robot basedon Robuter, a commercial platform produced by RoboSoft. CRAB(figure 1.6) is a robotic arm based on a two-link industrial manipulator, an IBM 7547 with a ScARA geometry. More details on the robots and their environments will be given in the relevant chapters.
ShapingRobots
Film' e 1.3 Autono Mouse II
1.5.2 The Agents' " Mind " The Autono Mice are connectedto ALECsvs(A LEarning Classifier Sy Stem ), a learning classifiersystemimplementedon a network of transputers (Dorigo and Sirtori 1991; Colombetti and Dorigo 1993; Dorigo 1995), which will be presentedin detail in chapter 2. ALECsvs allows one to design the architecture of the agent controller as a distributed system. Each component of the controller, which we call a " behavioral module," can be connectedto other componentsand to the sensorimotorinterface. As a learning classifiersystemexploiting a genetic algorithm, ALECSYS is basedon the metaphor of biological evolution. This raisesthe question of whether evolution theory provides the right technical language to characterizethe learning process. At the presentstageof the research, we are inclined to think it doesnot. There are various reasonswhy the languageof evolution cannot literally apply to our agents. First, we use an evolutionary mechanismto implement individual learning rather than phylogeneticevolution. Second, the
Chapter }
F1gare1.4 Autono Mouse IV
distinction betweenphenotypeand genotype, essentialin evolution theory, is in our caserather confused; individual rules within a learning classifier systemplay both the role of a single chromosomeand of the phenotype undergoing natural selection. In our experiments, we found that we tend to consider the learning systemas a black box, able to produce stimulusresponseS-R) associationsand categorizationsof stimuli into relevant . More precisely, we expectthe learning systemto equivalenceclasses . discoveruseful associationsbetweensensoryinput and ; and responses . categorizeinput stimuli so that precisely those categorieswill emerge which are relevantly associatedto responses . Given theseassumptions, the sole preoccupationof the designeris that the interactions between the agent and the environment can produce enough relevant information for the target behavior to emerge. As it will appear from the experimentsreported in the following chapters, this concern inftuencesthe designof the artificial environment' s objectsand of the ' agent s sensoryinterface.
Robots Shaping
F1gm ' e 1.5 HAU~
Film' e 1.6 CRAB
ChapterI
1.5.3 Typesof Behavior
A first , rough classification allows one to distinguish between stimulusresponseS -R) behavior, that is, reactive responsesconnecting sensorsto effectors in a direct way, and dynamic behavior, requiring some kind of internal state to mediate betweeninput and output . Most of our experiments (chapters 4 and 5) are dedicated to S-R ]>ehavior, although in chapter 6 we investigate some forms of dynamic behavior. To go beyond simpleS -R behavior implies that the agent is endowed with some form of internal state. The most obvious candidate ' for an internal state is a memory of the agent s past experience( Lin and Mitchell 1992; Cliff and Ross 1994; Whitehead and Lin 1995) . The designerhas to decide what has to be remembered, how to remember it , and for how long. Such decisionscannot be taken without a prior understandin of relevant properties of the environment. In an experimentreported in section4.2.3, we added a sensormemory, ' that is, a memory of the past state of the agent s sensors, allowing the learning systemto exploit regularities of the environment. We found that a memory of past perceptionsfirst makesthe learning processharder, but eventually increasesthe performanceof the learned behavior. By running a number of such experiments, we confirmed an obvious expectation, namely, that the memory of past perceptionsis useful only if the relationship betweenthe agentand its environmentchangesslowly enoughto preserve a high correlation betweensubsequentstates. In other words, agents with memory are " fitter " only in reasonably predictable environments. In a second set of experiments, discussedat length in chapter 6, we improve our learning systemcapabilities by adding a module to maintain a state word, that is, an internal state. We show that, by the use of the stateword , we are able to make our Autono Mice learn sequentialbehavior patterns, that is, behaviors in which the decision of what action to perform at time t is influencedby the actions performed in the past. We believe that experimentson robotic agentsmust be carried out in the real world to be truly significant, although such experimentsare in generalcostly and time- consuming. It is therefore advisableto preselecta small number of potentially relevant experimentsto be performed in the real world. To carry out the selection, we use a simulated environment, which allows us to have accurateexpectationson the behavior of the real agent and to prune the set of possibleexperiments. One of the hypotheseswe want to explore is that relatively complex . behavioral patternscan be built bottom-up from a set of simple responses
Robots Shaping This hypothesishas already been put to test in robotics, for example, by Arkin ( 1990) and by Saffiotti, Konolige and Ruspini ( 1995) . Arkin ' s " Autonomous Robot Architecture" integrates different kinds of information data behavioral schemes , , and world knowledge) in (perceptual order to get a robot to act in a complex natural environment. The robot , like walking through a doorway, as acom generatescomplex responses bination of competing simpler responses , like moving aheadand avoiding a static obstacle (the wall , in Arkin ' s doorway example). The compositional approach to building complex behavior has been given a formal justification in the framework of multivalued logics by Saffiotti, Konolige, and Ruspini. The key point is that complex behavior can demonstrably . We have emergefrom the simultaneousproduction of simpler responses consideredfour kinds of basic responses : I . Approaching behavior, that is, getting closer to an almost still object with given features; in the natural world , this responseis a fundamental component of feeding and sexualbehavior. 2. Chasingbeha)'ior, that is, following and trying to catch a still or moving object with given features; like the approachingbehavior, this responseis important for feedingand reproduction. 3. A voidancebehavior, that is, avoiding physical contact with an object of a given kind ; this can be seenas the artificial counterpart of a behavioral pattern that allows an organism to avoid objectsthat can hurt it . 4. Escapingbehavior, that is, moving as far as possible from an object with given features; the object can be viewed as a predator. As we shall see, more complex behavioral patternscan be built from these simple responsesin many different ways.
1.5.4 The Environment Our robots are presently able to act in closed spaces , with smooth floors and constant lighting, where they interact with moving light sources, fixed obstacles, and sounds. Of course, we could fantasizefreely in simulations, by introducing virtual sensorsable to detect the desiredentities, but then resultswould not be valid for real experimentation. We thereforeprefer to adapt our goals to the actual capacitiesof the agents. The environment is not interesting per se, but for the kind of interactions that can take place betweenit and the agent. Considerthe four basic " responsesintroduced in the previous subsection. We can call them objectual ' " in that they involve the agent s relationship with an external object.
Chapter 1
Objectual responsesare . type-sensitivein that agent-object interactions are sensitiveto the type to which the object belongs(prey, obstacle, predator, etc.); and . location-sensitivein that agent-object interactions are sensitive to the relative location of the object with respectto the agent.
Type sensitivity is interesting becauseit allows for fairly complex patterns of interaction, which are, however, within the capacity of an S-R . agent It requiresonly that the agent be able to discriminate some object feature characteristic of the type. Clearly, the types of objects an S-R agent can tell apart dependon the physical interactions betweenexternal ' objects and the agent s sensoryapparatus. Note that an S-R agent is not able to identify an object, that is, discern two similar but distinct objects of the sametype. The interactions we considerdo not dependon the absolutelocation of the objects and of the agent; they depend only on the relative angular position, and sometimeson the relative distance, of the object with respect to the agent'. Again , this requirement is within the capacities of an S-R agent. In the context of shaping, differencesthat appearto an external observer can be relevant even if they are not perceivedby the agent. The reasonis that the trainer will in general basehis reinforcing activity on the observation of the agent' s interaction with the environment. Clearly, from the point of view of the agent, a singlemove of the avoidanceor the escaping behavior would be exactly the same, although in complex behavior patterns, avoidanceand escapingrelate differently to other behaviors. In general, avoidance should modulate some other movement response, whereasescapingwill be more successfulif it suppresses all competing . As we shall seein the following chapters, this fact is going to responses inftuence both the architectural design and the shaping policy for the agent. For learning to be successful , the environment must have a number of properties. Given the kind of agent we have in mind , the interaction of a ' physical object with the agentdependsonly on the object s type and on its relative position with respectto the agent. Therefore, sufficient information about object types and relative positions must be available to the agent. This problem can be solved in two ways: either the natural objects existing in the environment have sufficient distinctive featuresthat allow them to be identified and located by the agent or the artificial objectsmust
Shaping Robots
so that theycanbe identifiedandlocated. For example bedesigned , if we want the agentto chaselight LI and avoidlight L2, the two lightsmust be of differentcolor, or havea differentpolarizationplane, to be distinguished . In anycase , identificationwill bepossible by appropriatesensors if the environment . For example the rest of , if light sensing cooperates only is involved, environmentallighting shouldnot vary too muchduring the agent' s life. In order for a suitable responseto depend on an object' s position, ' objects must be still , or move slowly enough with respectto the agent s speed. This doesnot mean that a sufficiently smart agent could not evolve a successfulinteraction pattern with very fast objects, only that such a pattern could not depend on the instantaneousrelative position of the ' object, but would involve some kind of extrapolation of the object s . trajectory, which is beyond the presentcapacitiesof ALECSYS
1.6 RERA VIOR ENGINEERING AS AN EMPIRICAL ENDEA VOR We regard our own work as part of artificial intelligence (AI ). Although certainly departing from work in classical, symbolic AI , it is akin to researchin the " new wave" of AI that started in the early 1980sand is focusedon the developmentof simple but complete agentsacting in the real world . In classicalAI , computer programs have been regarded sometimesas formal theories and sometimesas experiments. We think that both views are defective. A computer program certainly is a formal object, but it lacks the degreeof conciseness , readability, and elegancerequired from a theory. Moreover, programs usually blend together high-level aspects of theoretical relevancewith low -level elementsthat are there only to make it perform at a reasonablelevel. But a program is not by itself an experiment; at most, it is the starting point to carry out experiments (see, for example, Cohen 1995). Borrowing the jargon of experimental psychologists, by instantiating a program with specific parameters, we " have an " experimental subject, whose behavior we can observe and analyze. It is surprising to note how little this point of view is familiar in computer sciencein general, and in AI in particular. The classicalknow-how of the empirical sciencesregarding experimental design, data analysis, and hypothesistesting is still almost ignored in our field. Presumably, the reason is that computer programs, as artificial objects, are erroneously
Chapterl believedto be easily understandableby their designers. It is important to realize that this is not true. Although a tiny , well-written computer program can be completely understood by analyzing it statically as a formal object, this has little to do with the behavior of complex programs, especiallly in the area of nonsymbolic processing, which lacks strong theoretical bases. In this case, a great deal of experimental activity is necessaryto understand the behavior of programs, which behavealmost like natural systems. A disciplined empirical approach appearsto be of central importance in what we have called " behavior engineering," that is, in the principled developmentof robotic agents. Not only is the agent itself too complex to be completely understood by analytical methods, but the agent also interacts with an environment that often cannot be completely characterized a priori . As we shall argue in chapter 7, the situation calls for an experimentalactivity organizedalong clear methodological principles. The work reported in this book is experimentallyoriented: it consistsof a reasonablY, large set of experimentsperformed on a set of experimental subjects, all of which are instantiations of the same software system, ALECsvs. However, we have to admit that many of our experimentswere below the standardsof the empirical sciences . For this, we have a candid, but true, explanation. When we started our work , we partly shared the common prejudice about programs, and we only partially appreciatedthe importance of disciplined experimentation. But we learnedalong the way, as we hope someof the experimentswill show.
1.7 POINTSTO REMEMBER
. Shaping a robot means exploiting the robot ' s learning capacities to translate suggestionsfrom an external trainer into an effective control strategy that allows the robot to perform predefinedtasks. . As opposedto most work in reinforcementlearning, our work stresses the importance of the trainer in making the learning pr.ocessefficient enough to be usedwith real robots. . The learning agent, the sensorimotorcapacitiesof the robot , the environme in which the robot acts, and the trainer are strongly intertwined. Any successfulapproach to building learning autonomous robots will have to considerthe relations among thesecomponents. Reinforcementlearning techniquescan be usedto implement robot controllers with interesting properties. In particular, such techniquesmake it .
Robots Shaping " " possibleto adopt architecturesbasedon soft computing methods, like neural networks and learning classifier systems, which cannot be effectively programmed by hand. . The approach to designingand building autonomousrobots advocated in this book is a combination of designand learning. The role of designis to determine the starting architecture and the input- output interfaces of the control system. The rules that will constitute the control systemare then learnedby a reinforcementlearning algorithm. . Behavior engineering, that is, the discipline of principled robot development , will have to rely heavily on an empirical approach.
Chapter2 ALECSYS
2.1 THE LEARNING CLASSIFIER SYSTEM PARADIGM Learning classifiersystemsare a classof machine learning models which were first proposed by Holland ( Holland and Reitmann 1978; Holland 1986; Holland et al. 1986; Booker, Goldberg, and Holland 1989). A learning classifier system is a kind of paralIel production rule system in which two ttnds of learning take place. First , reinforcement learning is used to changethe strength of rules. In practice, an apportionment of credit algorithm distributes positive reinforcementsreceivedby the learning systemto those rules that contributed to the attainment of goals and negativereinforcementsto those rules that causedpunished actions. Second , evolutionary learning is used to searchin the rule space. A genetic algorithm searchesfor possibly new useful rules by geneticoperationslike reproduction, selection, crossover, and mutation. The fitness function used to direct the searchis the rule strength. As proposed by Holland , the learning classifier system or LCS is a general model; a number of low-level implementation decisionsmust be made to go from the genericmodel to the running software. In this section we describe LCSo, our implementation of Holland ' s learning classifier system, which unfortunately exhibits a number of problems: rule strength oscillation, difficulty in regulating the interplay between the reinforcement system and the background genetic algorithm (GA ), rule chains . We identified two approachesto extend instability , and slow convergence LCSo with the goal of overcoming theseproblems: ( I ) increasethe power of a single LCSo by adding new operators and techniques, and (2) use many LCSoSin parallel. The first extension, presentedin section2.2, gave raise to the " improved classifier system" or ICS (Dorigo 1993), a more powerful version of LCSo that still retains its spirit . The secondextension,
Chapter 2
discussedin section 2.3, led to the realization of ALECSvs (Dorigo and Sirtori 1991; Dorigo 1992, 1995), a distributed version of I CS that allows for both low-levelparallelism, involving distribution of the ICS over a set of transputers, and high-level parallelism, that is, the definition of many concurrent cooperating I CSs. To introduce learning classifier systems, we consider the problem of a simple autonomous robot that must learn to follow a light source (this problem will be consideredin detail in chapter4) . The robot interactswith the environment by meansof light sensorsand motor actions. The LCS controls the robot by using a set of production rules, or classifiers, which representthe knowledgethe robot has of its environment and its task. The effect of control actions is evaluated by scalar reinforcement. Learning mechanismsare in charge of distributing reinforcementsto the classifiers responsiblefor performedactions, and of discoveringnew usefulclassifiers. An LCS comprisesthree functional modules that implement the controller and the two learning mechanismsinformally describedabove (see 2.1 figure ):
LCS Rule system -
discovery new
good lers
classif
lers
classif
perceptions E Performance system n v actions i r current strength changes
strengths ~ m e n
reinforcements credit
of Apportionment
t system
2.1 Ji1gare . Learningclassifiersystem
ALECSYS
1. The performancesystemgoverns the interaction of the robot with the external environment. It is a kind of parallel production system, implementing a behavioral pattern, like light following, as a set of conditionaction rules, or classifiers. 2. The apportionmentof credit system is in charge of distributing reinforceme to the rules composing the knowledge base. The algorithm used is the " bucket brigade" ( Holland 1980; in previous LCS research, and in most of reinforcement learning research, reinforcementsare provided to the learning system whenever it enters certain states; in our approach they are provided by a trainer, or reinforcement program, as discussedin section 1.2). 3. The rule discoverysystemcreatesnew classifiersby meansof a genetic algorithm.
In LCSs, learning takes place at two distinct levels. First , the apportionmen of credit systemlearns from experiencethe adaptive value of a number of given classifierswith respectto a predefinedtarget behavior. Each classifier, maintains a value, called " strength," and this value is modified by the bucket brigade in an attempt to redistribute rewards to useful classifiersand punishmentsto useless(or harmful) ones. Strength is therefore used to assessthe degreeof usefulness of classifiers. Classifiers that have all conditions satisfied are fired with a probability that is a function of their strength. Second, the rule discovery mechanism, that is, the genetic algorithm, allows the agent to explore the value of new classifiers. The genetic algorithm explores the classifiers' space, recombining classifierswith higher strength to produce possibly better offspring. Offspring are then evaluatedby the bucket brigade. It must be noted that the LCS is a generalmodel with many degreesof freedom. When implementing an LCS, it is therefore necessaryto make a number of decisionsthat transform the genericLCS into an implemented LCS. In the rest of this sectionwe presentthe three modulesof LCSo, our 1 implementation of an LCS. ICS and ALECsys, presentedin sections2.2 and 2.3, were obtained as improvementsof LCSo.
2.1.1 LCSo: The PerfonnanceSystem The LCSo perfonnance system(seefigure 2.2 and tenninology box 2.1) consistsof . the classifierset CS; . an input interfaceand an output interfacewith the environment (sensors and effectors) to receive/sendmessagesfrom / to the environment;
Chapter 2
.mes Box 2 1 Exa + Con the cla 1 01 01 . H th # se c i ; mat the wh th fir co is m , onl by me b " ' with a in se Th f d t o # any mes po sy s "care tha bot 0 or 1 m If . b c ,the , me sym p are ma so ru is a a by , me a ct is the 01 in ou is , /acti par , , str ex list at the i it f is to be i n foll ( ste in or is sen to eff e )by tha an .som act If so eff ,disc ) m pr c the con res is an co ap of the . ar me Figure2.2 LCSoperformance system
.sc Bo 2 1 Te .str M w a is a a o ru r In K K , ' LC ( ) ( ) t r cla * * I . M o n o O i a s n on O of # , , s l e } le { } { M i t o c w th , K l ,for KV co Inf in w t cl r h u ac th c pa = i 1 2 d . ac al ca an ) E ( m sin f th b is se of b st + . 0 1 in h su me p th e o An a ef int cla .(the a e M is u lme Th is m ( ) m i f s a r M wh co L , ,sif c )cla in c s lis w c l M w tim an , ) m ( st e pr t t u to s b p pr .the tim st . h se C th .Th th c m s M is e g o (ste l o m ha b of b al tm b = m -cln(M e in i-iin w f Al ,o -w c t),an w se o in h n is )a ALECSYS
. the messagelist ML , composedof the three sublists: MLI , which collects messagessent from classifiersat the precedingtime step, MLENV , which collects messagescoming from the environment through the sensors , and MLEFF , which contains messagesusedto choosethe actions at the precedingtime step; . the match set MS, that is, the multiset containing classifiersthat have both conditions matched by at least one message(seeexample box 2.1) . MS is composedof MS-eff and MS-int , that is two multisets containing classifiersin MS that want to sendtheir messagesto effectorsand to the internal messagelist MLI , respectively; . a conflict resolutionmodule, to arbitrate conflicts among rules in MS-eff that proposecontradictory actions; and . an auction module, to select which rules in MS-int are allowed to appenda messageto MLI . In simplified terms, the LCSo performance systemworks like this. At time zero, a set of classifiersis randomly created, and both the message
2 Chapter list and the match set are empty. Then the following loop is executed. Environmental messages , that is, perceptions coming from sensors, are appendedto the messagelist MLENV and matched against the condition part of classifiers. Matching classifiersare addedto the match set, and the messagelist is then emptied. If there is any messageto effectors, then , a conflict resolution module correspondingactionsare executed(if necessary is called to chooseamong conflicting actions); messagesused to select performed actions are appendedto MLEFF . If the cardinality of MS-int does not exceedthat of MLI , all messagesare appended. Otherwise, the auction module is used to draw a k -samplefrom MS-int (where k is the cardinality of MS-int ); the messagesof the classifiersin the k-sampleare then appendedto MLI . The match setis emptied and the loop is repeated. The need to run an auction is one reason an apportionment of credit algorithm is introduced. As it redistributesreinforcementsto the rules that causedthe performed actions, the algorithm changestheir strengths. This allows the systemto choosewhich rules to selectin accordanceto some measureof their usefulness. The rules' strength is also used for conflict resolution antong rules that send messagesto effectorsproposing inconsistent actions (e.g., " Go right " and " Go left " ) . Algorithm box 2.1 givesa detailed formulation of the performancealgorithm.
2.1.2
LCSo: The Apportionmentof Credit System We have said that the main task of the apportionment of credit algorithm is to classify rules in accordancewith their usefulness. In LCSo, the algorithm works as follows: to every classifierC a real value called " strength," Str(C), is assigned.At the beginning, eachclassifierhas the samestrength. When an effector classifier causesan action on the environment, a reinforcem may be generated, which is then transmitted backward to the internal classifiersthat causedthe effector classifierto fire. The backward transmissionmechanism, examined in detail later, causesthe strength of classifiersto changein time and to reflect the relevanceof each classifier to the systemperformance(with respectto the systemgoal) . Clearly, it is not possible to keep track of all the paths of activation actually followed by the rule chains (a rule chain is a set of rules activated in sequence , starting with a rule activated by environmental messagesand ending with a rule performing an action on the environment) becausethe number of thesepaths grows exponentially with the length of the path. It is then necessaryto have an appropriate algorithm that solvesthe problem using only local information , in time and in space.
2 1 .l.As Bo Alg :o Th LC pe sy an in to c a c t C , st ev m lis oo I4 Re en a t to M . h m 2 in Ex th M b a op m in . co of cl C pa 3 If no cla is l i e c n , y the de = . :7 .6 MS w h e c m C , b lea on m M 5 Re al fro in . re to cl M T m Ap r ar a s t w o bo p ap th ac . co If iIoo n S , (9 M ) ( ) Le L e = :Th h the d an to ap tit d of m = k els dr an C , in s [ ] a w fro p r ~ C C h ( ) Pr / S tthe is th au { } to M t . h M m ap in d . Q a of cla ; c { o c ( ) St st } en .tha 8 Re al fro E s t V C h If ex at le on ef = ".(M \CS EA i th o t w , ,ML (cla M 0 p t co . h wi P e ap pr a s d : o us in th sy al = .MS .ma c C ; i C I } {me C = M M is li u u m = M i is lis of M I ' , ,ha + m ) ( )'s.ea . t l s lis of en l ; , t a a ca an ac t h s p -Th o c -m M e is th th f iint nt ,oi,w s e i is s n lcl m a o e co b y m -M iT e M ,p (K )= lK .w nK i m ALECSYS
Chapter 2
" Local in time" means that the information used at every computational step is coming only from a fixed recent temporal interval. " Local in space" meansthat changesin a classifier's strength are causedonly by classifiers directly linked to it (we say that classifiers Cl and C2 are " linked" if the messagepostedby Cl matchesa condition of C2). The classical algorithm used for this purpose in LCSo, as well as in most LCSs, is the bucket brigade algorithm ( Holland 1980). This algorithm models the classifiersystemas an economic society, in which every classifierpays an amount of its strength to get the privilege of appending a messageto the messagelist and receivesa payment by the classifiers activated becauseof the presencein the messagelist of the messageit 2 appendedduring the precedingtime step. Also, a classifiercan increase (decrease ) its strength wheneveran action it proposesis performed and ~ results in receiving a reward (punishment) . In this way, reinforcement flows backward from the environment (trainer) again to the environment (trainer) through a chain of classifiers. The net result is that classifiers participatingI in chainsthat causehighly rewardedactions tend to increase their strength. Also , our classifiers lose a small percentage of their " " strength at each cycle (what is called the life tax ) so that classifiersthat are never usedwill sooneror later be eliminated by the geneticalgorithm . In algorithm box 2.2, we report the apportionment of credit system. Becausethe bucket brigade algorithm is closely intertwined with the performan system, a detailed description of how it works is more easily given together with the description of the performance system. Bucket brigade instructions are those steps of the credit apportionment system not already presentin the performancesystem(they can be easily recognized becausewe maintained the samename for instructions in the two systems, and the bucket brigade instructions were just added in between the performancesysteminstructions) . One of the effectsof the apportionment of credit algorithm should be to contribute to the formation of default hierarchies, that is, setsof rules that . A rule categorizethe set of environmental statesinto equivalenceclasses in a default hierarchy is characterizedby its specificity. A maximally specific rule will be activated by a single type of message(see, for example, the set of four rules in the left part of figure 2.3) . A maximally general (default) rule is activated by any message(see, for example, the last rule in the right part of figure 2. 3) . Consider a default hierarchy composedof two rules, as reported in the right part of figure 2. 3. In a default hierarchy, whenevera set of rules have
ALECSYs
2 Chapter
Non
hierarch
ical
Hierarchical
set set
( homomorphic
)
( quasi
-
set set
homomorphic
00
~
11
00
~
11
01
~
OO
II
~
00
1 0
~
00
11
~
00
)
2.3 F1gore samestatecategorization with fewerrules. Hierarchicalsetimplements
all conditions matched by at least a message , it is the most specificthat is allowed to fire. Default hierarchieshave some nice properties. They can be used to build quasi-homomorphic models. Let " be the set of environmen statesthat the learning systemhas to learn to categorize. Then, the systemcould either try to find a set of rules that partition the whole set " , never making mistakes(left part of figure 2.3), or build a default hierarchy (right part of figure 2.3) . With the first approach, a homomorphic model of the environment is built ; with the second, a quasi-homomorphic model, which generally requiresfar fewer rules than an equivalent homomorphic one ( Holland 1975; Riolo 1987, 1989). Moreover, after an initial quasi-homomorphic model is built , its performancecan be improved by adding more specific rules. In this way, the whole system can learn gracefully, that is, the performance of the system is not too strongly influencedby the insertion of new rules.
2.1.3 LCSo: De RuleDiscoverySystem The rule discoverysystemalgorithmusedin LCSo, as well as in most LCSs, is the geneticalgorithm(GA). GAs are a computationaldevice
ALECSYs 2
.
2
Box Terminology .
.
of
individuals
multiset
a
is
A population .
. , a
,
genes
Typically feasible
k
or
of
a
positions individual
is
is
)
string problems
framework
are
optimization
LCS
in
the
An
( to
an
individual
individual
chromosome
an ,
a
is
when
applied
when
,
GAs
hand
other
the
.
On
solution
, .
classifier .
is
an
that
to
value
alphabet
allelic
an
can
assume
A gene
belonging =
the
in
found
be
to
are
choice
reasons
this
The
underlying been
has
It
.
the algorithm
the
of
the ,
is the
contained
information
lower
rule
the
that
in When
individuals
discovery
1975
Holland
)
useful
are
usually
{
algorithm the
.
shown (
of
the
algorithm
processing classifiers
A
O ,
used ,
cardinality
alphabet
I
. }
that
,
genetic
efficiency
chromosomes
the
of
higher
structure
in
the
, "
' care
,
t
" don In
allelic , action
the
#
(
) of
often
with
symbol to
rules
allelic
the
while ,
# }
I ,
O ,
for
(
allows condition
belong I
. }
O ,
which
general
to {
then
alphabet
default )
represented values
is
augmented be
.
LCSo
of
values
genes to
{
belong
genes
)
message (
inspired by poptllation genetics. We give here only some general aspects of their working ; the interestedreader can refer to one of the numerous books about the subject( for example, Goldberg 1989; Michalewicz 1992; Mitchell 1996) . A GA is a stochasticalgorithm that works by modifying a population of solutions3to a given problem (seeterminology box 2.2) . Solutions are coded, often using a binary alphabet, although other choicesare possible, and a function, called a " fitnessfunction," is definedto relate solutions to performance(that is, to a measureof the quality of the solution). At every cycle, a new population is createdfrom the old one, giving higher probability to reproduce(that is, to be presentagain in the new population of solutions) to solutions with a fitnesshigher than average, wherefitness is defined to be the numeric result of applying the fi ~ ess function to a solution. This new population is then modified by meansof somegenetic operators; in LCSo theseare . crossover , which recombinesindividuals (it takes two individuals, operates a cut in a randomly chosenpoint , and recombinesthe two in such a way that some of the genetic material of the first individual goes to the secondone and vice versa; seefigure 2.4); and . mutation, which randomly changessome of the allelic values of the genesconstituting an individual .
2 Chapter
F1pre2.4 Exampleof crossoveroperator: (a) two individuals(parents ) are mated and crossover is applied, to generate (b) two newindividuals(children). Algorit I UDBox 2.3 LCSo: The rule discoverysystem(geneticalgorithm) O. Generatean initial random population P. repeat 1. Appfy the Reproduction operator to P: pi := Reproduce( P) . 2. Apply the Crossoveroperator (with probability Pc) to P' : pi' := Crossover(pI) . 3. Apply the Mutation operator (with probability Pm) to P" : pi" := Mutation ( P" ). 4. P := P"' . _ to EndTest = true.
After thesemodifications, the reproduction operator is applied again, and the cycle repeatsuntil a termination condition is verified. Usually the termination condition is given by a maximum number of cycles, or by the lack of improvementsduring a fixed number of cycles. In algorithm box 2.3, we report the rule discoverysystem(geneticalgorithm) . The overall effect of the GAs' work is to direct the searchtoward areas of the solution spacewith higher valuesof the fitnessfunction. The computational speedupwe obtain using GAs with respectto random searchis due to the fact that the search is directed by the fitness function. This direction is based not on whole chromosomesbut on the parts that are strongly related to high values of the fitness function; these parts are called " building blocks" ( Holland 1975; Goldberg 1989). It has been proven ( Booker, Goldberg, and Holland 1989; Bertoni and Dorigo 1993)
ALECSYS
that GAs process at each cycle a number of building blocks at least proportional to a polynomial function of the number of individuals in the population, where the degreeof the polynomial dependson the population dimension. In LCSs, the hypothesis is that useful rules can be derived from recombination of other useful rules.4 GAs are then used as a rule discovery system. Most often, as in the caseof our system, in order to preserve the systemperformance, they are applied only to a subsetof rules: the m best rules (the rules with higher strength) are selectedto form the initial population of the GA , while the worst m rules are replacedby those arising from application of the GA to the initial population. After applying the GA , only some of the rules are replaced. The new rules will be tested by the combined action of the performance and credit apportionment algorithms. Becausetesting a rule requiresmany time steps, GAs are applied with a much lower frequency than the performance and apportionmen of credit systems. In order to improve efficiency, we added to LCSo a cover detector operator and a' covereffectoroperator (thesecover operatorsare standard in most LCSs) . The cover detector fires whenever no classifier in the classifier set is matched by any messagecoming from the environment (that is, by messages in MLENV ) . A new classifieris createdwith conditions that match at least one of the environmental messages(a random number of # s are insertedin the conditions), and the action part is randomly generated. The cover effector, which is applied with a probability PCB ' assuresthe samething on the action side. Wheneversomeaction cannot be activated by any classifier, then a classifier that generatesthe action and that has random conditions is created. In our experiments, we setPCB= 0.1. Covering is not an absolutely necessaryoperator becausethe systemis in principle capable of generating the necessaryclassifiersby using the basic GA operators(crossoverand mutation) . On the other hand, because this could require a great deal of time, covering is a useful way to speed up searchin particular situations easily detectedby the learning system.
2.2 ICS: IMPROVEDCLASSIFIER SYSTEM In this section, we describeICS, an improved version of LCSo. The following major innovations have been introduced (for the experimental evaluation of thesechangesto LCSo, seesection4.2) .
2 Chapter
2.2.1 I CS: CaUingdie GeneticAlgorithm When a SteadyState Is Reached In LCSo, as in most classic implementations of LCSs, the genetic algorithm (i.e., reproduction, crossover, and mutation) is called every q cycles, q being a constant whoseoptimal value is experimentally determined. A drawback of this approach is that experimentsto find q are necessary , and that , even if using the optimal value of q, the geneticalgorithm will not be called, in general, at the exact moment when rule strength accurately reflectsrule utility . This happensbecausethe optimal value of q changesin time, dependingon the dynamicsof the environment and on the stochastic processes embedded in the learning algorithms. If the geneticalgorithm is called too soon, it usesinaccurateinformation ; if it is called long after a steadystate has beenreached, there is a waste of computing time (from the GA point of view, the time spent after the steady state has been reachedis useless ) . A better solution is to call the genetic algorithm when the bucket brigade has reacheda steadystate; in this way, we can reasonably expect that when the genetic algorithm is called, the strength of every classifierreflectsits actual usefulnessto the system. The problem is hcfwto correctly evaluatethe attainment of a steadystate. We " "5 have introduced a function ELCS (t ) , called energy of the LCS at time t, defined as the sum of the strengthsof all classifiers. An LCS is said to be " at a " steadystate at time t when
' ELCS , Emax (t ) E [EDlin ], where
for all te [t - k , t),
" " Emin= min { ELCS(t ), t e [t - 2k , t - k ]} , and " " Emu = max{ ELCS (t ) , t e [t - 2k , t - k ]} ,
k being a parameter. This excludescasesin which ELCS (t) is increasingor and in those which t is still ELCS decreasing () oscillating too much. Experimentshave shown that the value of k is very robust; in our experiments (see section 4.2; and Dorigo 1993), k was set to 50, but no substantia differenceswere found for valuesof k in the range between20 and 100. 2.2.2 ICS: The MutespecOperator Mutespec is a new operator we introduced to reduce the variance in the reinforcementreceivedby default rules. A problem causedby the presence
ALECSYS
Figure2.5 Exampleof oscillatin ~: classifier
Figure2.6 operator Exampleof applicationof mutespec " of " don' t care ( # ) symbols in classifierconditions is that the sameclas, and low sifier can receivehigh rewardswhen matched by somemessages re)J'ards (or even punishmentsif we use negative rewards) when matched " " the by others. We call these classifiers oscillating classifiers.' Consider " " don t care has a classifier 2.5 the symbol ; oscillating examplein figure in the third position of the first condition. Supposethat whenever the classifieris matched by a messagewith a ,I in the position corresponding to the # in the classifiercondition, the messageI I I I is useful, and that wheneverthe matching value is a 0, the messageI I I I is harmful . As a result, the strength of that classifiercannot convergeto a steadystate but will oscillate betweenthe values that would be reachedby the two more specific classifiersin the bottom part of figure 2.6. The major problem with oscillating classifiersis that, on average, they will be activated too often when they should not be used and too seldom when they could be useful. This causesthe performance of the system to be lower than it could be. The mutespecoperator tries to solve this problem using the oscillating classifieras the parent of two offspring classifiers, one of which will have a
Chapter 2
0 in place of the # , and the other, a 1 (seefigure 2.6). This is the optimal solution when there is a single # . When the number of # symbols is greater than 1, the # symbol chosencould be the wrong one (i.e., not the one responsiblefor oscillations) . In that case, becausethe mutespecoperator did not solve the problem, it is highly probable the operator will be applied again. The mutespecoperator can be likened to the mutation operator. The main differencesare that mutespecis applied only to oscillating classifiersand that it always mutates # symbolsto Osand 1s. And while the mutation operator mutates a classifier, mutespec introduces two new, more specific, classifiers (the parent classifier remains in the population). To decidewhich classifiershould be mutated by the mutespecoperator, we monitor the variance of the rewardseach classifiergets. This variance is computed according to the following formula:
where Rc (t ) is the reward receivedby classifierC at time t. A classifierC is an oscillating classifierifvar (Rc) ~ r . var , wherevar is the averagevariance of the classifiersin the population and r is a user definedparameter(we experimentallyfound that a good value is r = 1.25) . At every cycle, mutespecis applied to the oscillating classifier with the highestvariance. ( If no classifieris an oscillating classifier, then mutespec is not applied.)
2.2.3 ICS: Dynamically Changingthe Nmnber of Claaifiers Used In ICS, the number of used rules dynamically changesat run time (i.e., the classifierset' s cardinality shrinks as the bucket brigade finds out that some rules are uselessor dangerous). The rationale for reducing the number of usedrules is that when a rule strengthdrops to very low values, its probability of winning the competition is also very low ; should it win the competition, it is very likely to propose a uselessaction. As the time spent matching rules against messagesin the messagelist is proportional to the classifier set' s cardinality, cutting down the number of matching rules results in the ICS performing a greater number of cycles (a cycle goes from one sensingaction to the next) than LCSo in the same time period. This causesa quicker convergenceto a steady state; the genetic
ALECSYS
algorithm can be called with higher frequency and therefore a greater number of rules can be tested. In our experiments, the useof a classifieris inhibited when its strength becomeslower than h . S(t ) , where S(t ) is the averagestrength of classifiersin the population at time I , and h is a user defined parameter (0.25 Sh S 0.35 was experimentally found to be a good range for h).
2.3 THE ALEcsysSYSTEM ALEcsvs is a tool for experimentingwith parallel I CSs. One of the main problems faced by I CSs trying to solve real problems is the presenceof heavy limitations on the number of rules that can be employeddue to the linear increasein the basic cycle complexity with the classifier set' s cardinality . One possibleway to increasethe amount of processedinformation without slowing down the basicelaboration cycle would be the useof " " parallel architectures, such as the Connection Machine ( Hillis 1985) or " " the transputer ( INMOS 1989) . A parallel implementation of an LCS on the Connection Machine, proposedby Robert son (1987), has demonstrated the power of such a solution, while still retaining, in our opinion, a basic limit . Becausethe Connection Machine is a single-instructionmultiple- data architecture (SIMD ; Flynn 1972), the most natural design for the parallel version of an LCS is basedon the " data parallel" modality , that is, a single flow of control applied to many data. Therefore, the resulting implementation is a more powerful, though still classic, learning classifiersystem. Becauseour main goal was to give our systemfeaturessuch as modularity , flexibility , and scalability, we implementedALE Csvson a transputer ' 6 system. The transputer s multiple-instruction-multiple- data architecture ( MIMD ) permits many simultaneouslyactive flows of control to operate on different data setsand the learning systemto grow gradually without major problems. We organized ALECsvsin such a way as to have both SIMD -like and MIMD -like forms of parallelism concurrently working in the system. The first was called " low-level parallelism" and the second, " " high-level parallelism. Low -level parallelism operateswithin the structure of a single ICS and its role is to increasethe speedof the ICS. High level parallelism allows various I CSs to work together; therefore, the complete learning task can be decomposedinto simpler learning tasks running in parallel.
Chapter 2
2.3.1 Low-Level Parallelism: A Solution to SpeedProblems We turn now to the microstructureof A LECSYS , that is, the way parallelism was used to enhancethe performance of the ICS model. ICS, like LCSo and most LCSs, can be depicted as a set of three interacting systems (see figure 2. 1) : ( 1) the performance system; (2) the credit apportionment system(bucketbrigade algorithm); and (3) the rule discoverysystem (geneticalgorithm). In order to simplify the presentationof the low-level parallelization algorithms, the various ICS systemsare discussedseparately . We show how first the performance and credit apportionment systems and then the rule discovery system (genetic algorithm) were parallelized. 2.3.1.1 The Performanceand Apportionmentof Credit Systems As we have seenin section 2.1, the basic execution cycle of a sequential learning classifier systemlike LCSo can be looked at as the result of the interaction betweentwo data structures: the list of messages , ML , and the set of classifiers, CS. Therefore, we decomposea basic execution cycle into two co~current processes, M Lprocess and CSprocess . M Lprocess communicateswith the input processSEprocess(SEnsor), and the output process, EFprocess( EFfector), as shown in figure 2.7. The processes communicate by explicit synchronization, following an algorithm whosehigh-level description is given in algorithm box 2.4. Becausesteps3 and 4 (matching and messageproduction) can be executed on eachclassifierindependently, we split CSprocessinto an array of concurrent subprocesses { CSprocessl, . . . , CSprocess ;, . . . , CSprocessn }, each taking care of Iln of CS. The higher n goes, the more intensive the concurrency is. In our transputer-based implementation, we allocated about 100- 500 rules to each processor (this range was experimentally determinedto be the most efficient; seeDorigo 1992). CSprocessescan be organized in hierarchical structures, such as a tree (seefigure 2.8) or a toroidal grid (the structure actually chosendeeply influencesthe distribution of computational loads and, therefore, the computational efficiency of the overall system; seeCamilli et ale 1990). Many other stepscan be parallelized. We propagatethe messagelist ML from M Lprocess to the i -th CSprocessand back, obtaining concurrent processingof credit assignmenton each CSprocess ;. ( The auction among triggered rules is subjectto a hierarchical distribution mechanism, and the samehierarchical approach is applied to reinforcementdistribution.)
ALBCSYS
I
reinforcements 2.7 F1gare Concurrentprocess esin ICS
2.view 4of Alaoritllm Process -oBox classifier riente4 learning systems while not do stopped 1.message M receives from S Lprocess and them messages in the Eprocess places list ML . 2 M sends ML to C . Lprocess Sprocess " " 3..C matches ML and CS Sprocess bids for each , calculatin trigg rule . 4 ..C sends M the list of Sprocess rules . Lprocess triggered S M erases the old list and Lprocess makes an auctio message amo rules the winners selected with ; triggered to their , bid are allow respect , to their own thus a post new , list ML . messages composing messa .7.E 6 M sends ML to E . Lprocess Fprocess chooses the action to and if Fprocess disca con , apply , necess from ML E receives messages ; reinforc and is Fprocess then able to calculate the reinforceme owed to each in ML messa ; this list of reinforcements is sent back to M with the , Lproces togeth ML .the remaining .9.C 8 M sends set of and Lprocess reinforce to C . messages Spro modifies the of CS Sprocess elements bids strengths , , paying assig reinforcements taxes . collecting eDdwh De,and
Chapter 2
Parail eliCS ~
~
E
~
~
~
~
~
E
~ ions percept ions -
.
reinforcements
CS4 I " =
~
~
CSp
S5 _
~
c
~ .
cs
~ ~
2.8 F1gare of figure2.7 is split into six Parallelversionof theICS. In thisexample , CSprocess . . . . C concurrentprocess es: CSprocessl S6 , , Sproces
1.3.1.1 The Rule DiscoverySystem We now illustrate briefty our parallel implementation of the rule discovery system- the geneticalgorithm or GA . A first process, GAprocess, can be assignedthe duty to select from among CS elementsthose individua that are to be either replicated or discarded. It will be up to the , after receiving GAprocess decisions, to apply genetic (split) CSprocess each ; focusing upon its own fraction of CS , single CSprocess operators . population Becauseit could affect CS strengths, upon which genetic selection is based, M Lprocessstaysidle during GA operations. Likewise, GAprocess is " dormant" when M Lprocessworks. Our parallel version of the genetic algorithm is reported in the algorithm box 2.5. Considering our parallel implementation, step I may be seen as an auction, which can be distributed over the processor network by a
A Box 2 .5 in Igori6m Parallel A LB C Sys genetic algorithm .2.Each 1 C S selects within its own subset of classif m rule to i , , process and m to n : m ote is a . replicate replace ( system ) parame Each C S sends G some data about each selec clas i process Aprocess sifier G to set a hierarch auctio base on , enabling Aprocess up in values this results 2m individ with the ; strength process selecting . overall CS population 3.parent G sends the data to each C S a i Aprocess following proces conta : classifier identifier of the itself , parent identifiers of the two , offspring . crossover point At the same time G sends to the C S a i Aprocess proces conta pos the data : for any following offspring identifier of the , offspring identifiers of the two , parents crossover . point 4.All in fractio of sen the C es that have their own CS a p rocess parents ~ of the rule to the C S that has the i parent process corre copy will crossov and muta ;this position process apply ope offspring will overwrite rules to be with rule . and replaced newly gener ALECSYS
" gathering and broadcasting mechanism, similar to the one we usedto propagatethe messagelist. Step 2 (mating of rules) is not easy to parallelize becauseit requires a central managementunit . Luckily , in LCS applications of the GA , the number of pairs is usually low and concurrency seemsto be unnecessary(at least for moderate-sized populations ) . Step 3 is the hardest to parallelize becauseof the amount of communication it requires, both between M Lprocess and the array of . Step 4 is a typical split CSprocesses and among CSprocesses themselves example of local data processing, extremely well suited to concurrent distribution. " hierarchical
2.3.2 High- Level Parallelism: A Solution to Beha'rioral Complexity Problems In section2.3.1 we presenteda method for parallelizing a singleICS, with the goal of obtaining improvementsin computing speed. Unfortunately , this approach shows its weaknesswhen an ICS is applied to problems involving multiple tasks, which seemsto be the casefor most real problems . We proposeto assignthe executionof the various tasks to different
2 Chapter
ligure 2.9 . Using ALECSYS , Exampleof concurrenthigh-level and low-level parallelism learningsystemdesignerdividesprobleminto threeICSs(high-levelparalleliza tion) andmapseachICS onto subnetof nodes(low-levelparallelization ). I CSs. This approach requiresthe introduction of a stronger form of parallelism " " , which we call high-level parallelism. Moreover, scalability problems arise in a low-level parallelized ICS. Adding a node to the transputer network implementing the ICS makes the communication 'load grow faster than computational power, which ' results in a less than linear speedup. Adding processorsto an existing network is thus decreasinglyeffective. As already said, a better way to deal with complex problems would be to code them as a set of easier the processor subproblems, each one allocated to an ICS. In ALBCSYS network is partitioned into subsets,eachhaving its own sizeand topology. To each subsetis allocated a single ICS, which can be parallelized at a low level (see figure 2.9) . Each of these I CSs learns to solve a specific subgoal, according to the inputs it receives. BecauseALEcsys is not provided with any automated way to come up with an optimal or even a good decomposition of a task into subtasks, this work is left to the learning system designer, who should try to identify independent basic tasks and coordination tasks and then assign them to different I CSs. The design approach to task decomposition is common to most of the current researchon autonomous systems(see, for example, Mahadevan and Connell 1992; and Lin 1993a, 1993b), and will be discussedin chapter 3.
ALECSYs
2.4 POINTSTO REMEMBER . The learning classifiersystem( LCS) is composedof three modules: ( I ) the performancesystem; (2) the credit apportionment system; and (3) the rule discoverysystem. . The LCS can be seenas a way of automatically synthesizinga rulebased controller for an autonomous agent. The synthesizedcontroller consistsof the performancesystemalone. . The learning componentsin an LCS are the credit apportionment system and the rule discovery system. The first evaluatesthe usefulness of rules, while the seconddiscoverspotentially useful rules. . ICS (improved classifier system) is an improved version of LCSo, a ' particular instantiation of an LCS. I CS s major improvementsare ( I ) the GA is called only when the bucket brigade has reacheda steadystate; (2) " " a new genetic operator called mutespec is introduced to overcomethe negativeeffectsof oscillating classifiers; and (3) the number of rules used by the performancesystemis dynamically reducedat runtime. . ALECSYS is ' a tool that allows for the distribution of I CSs on coarsegrained parallel computers (transputers in the current implementation) . , a designercan take advantageof both parallelizaBy meansof ALECSYS tion of a single ICS and cooperation among many I CSs.
Chapter
3
Architectures
and Shaping
Po Hcies
3.1 THE STRUCTURE OF REBAVIOR As we have noted in chapter I , behavior is the interaction of the agent with its environment. Given that our agentsare internally structured as an LCS and that LCSs operatein cycles, we might view behavior as a chain of elementary.interactions betweenthe agent and the environment, each of which is causedby an elementaryaction performed by the agent. This is, however, an oversimplified view of behavior. Behavior is not just a fiat chain of actions, but has a structure, which manifests itself in basically two ways:
I . The chain of elementary actions can be segmentedin a sequenceof different segments , each of which is a chain of actions aimed at the execution of a specifictask. For example, an agent might perform a number of actions in order to reach an object, then a number of actions to grasp the object, and finally a chain of actions to carry the object to a container. 2. A single action can be made to perform two or more tasks simultaneously . For example, an agent might make an action in a direction that allows it both to approach a goal destination and to keep away from an obstacle. Following well-establishedusage, we shall refer to the actions aimed at " accomplishinga specifictask as a behavior" ; we shall therefore speakof activities such as approaching an object, avoiding obstacles, and the like as " specificbehaviors." Some behaviors will be treated as basic in the sensethat they are not structured into simpler ones. Other behaviors, seen to be made up of simpler ones, will be regarded as complex behaviors, whose component behaviorscan be either basic or complex.
Chapter3 Complex behaviorscan be obtained from simpler behaviors through a number of different composition mechanisms. In our work , we consider the following mechanisms: . Independentsum: Two or more behaviors involving independenteffectors are performed at the sametime; for example, an agent may assumea mimetic color while chasinga prey. The independentsum of behaviors IX and p will be written as . IXIP . Combination: Two or more behaviors involving the same effectors are combined into a resulting one; for example, the movement of an agent following a prey and trying to avoid an obstacle at the same time. The combination of behaviorsIXand p will be written as
~
.
IX+ p . . Suppression : A behavior suppresses a competing one; for example, the agent may give up chasing a prey in order to escapefrom a predator. When behavi( " IXsuppressesbehavior p , we write
. Sequence : A behavior is built as a sequenceof simpler ones; for example , fetching an object involves reaching the object, grasping it , and coming back. The sequenceof behavior (Xand behavior p will be written as (X. p ; moreover, we shall write " 0' . " when a sequence0' is repeatedfor an arbitrary number of times. In general, severalmechanismscan be at work at the same time. For example, an agent could try to avoid fixed hurting objectswhile chasinga moving prey and being ready to escapeif a predator is perceived. With respectto behavior composition, the approach we have adopted so far is the following . First , the agent designerspecifiescomplex behaviors in terms of their component simpler behaviors. Second, the designer definesan architecture such that each behavior is mapped onto a single behavioral module, that is, a single LCS of ALECsys. Third , the agent is trained according to a suitable shapingpolicy, that is, a strategy for making the agent learn the complex behavior in terms of its components. In the next sections, we review the types of architectureswe have usedin
Architectures andShaping Policies
3.2 TYPESOF ARCI UTE Cf URES By using ALEcsYS, the control systemof an agent can be implementedby a network of different LCSs. It is therefore natural to consider the issue of architecture, by which we mean the problem of designingthe network that bestfits somepredefinedclassof behaviors. As we shall see, there are interestingconnectionsbetweenthe structure of behavior, the architecture of controllers, and shaping policies (i.e., the ways in which agents are trained) . In fact, the structure of behavior determinesa " natural" architecture , which in turn constrainsthe shapingpolicy . Within ALEcsys, realizable architecturescan be broadly organized in two classes : monolithic architectures, which are built by one LCS directly connectedto the agent' s sensorsand effectors; and distributed architectures , which include severalinteracting LCSs. Distributed architectures, in turn , can be mbre or less" deep" ; that is, they can beflat architectures, in which all LCSs are directly connectedto the agent' s sensorsand effectors, or hierarchicalarchitectures, built by a hierarchy of levels. In principle, any behavior that can be implemented by ALEcSYs can be implemented through a monolithic architecture, although distributed architecturesallow for the allocation of specific tasks to different behavioral modules, thus lowering the complexity of the learning task. At the , but have to be presentstage, architecturescannot be learned by ALECSYS a certain serves to include in other words an architecture , designed; amount of a priori knowledgeinto our agentsso that their learning effort is reduced. To reach this goal, however, the architecture must correctly match the structure of the global behavior to be learned.
3.2.1 MonolithicArchitectures The simplestchoiceis the monolithic architecture , with one LCS in chargeof controllingthe wholebehavior(figure3.1). As we havealready noted, this kind of architecturein principlecan implementany behavior of whichALECsysis capable , althoughfor nontrivial behaviorsthe complexity of learninga monolithiccontrollercanbe too high. Usinga monolithicarchitectureallowsthe designerto put into the system . Suchinitial knowledge aslittle initial domainknowledgeaspossible
Chapter 3
LCS
LCS
Environment
Environment
Figme 3.1 Monolithic architectures
Environment
Ji1gure3.2. Flat architectures
is embeddedin the design of the sensorimotor interface, that is, in the structure of messagescoming from sensorsand going to effectors. If the target behavior is made up of severalbasic responses , and if such basicresponsesreact only to part of the input information , the complexity of learning can be reduced even by using a monolithic architecture. Instead of wrapping up the state of all sensorsin a single message(figure 3.la ), one can distribute sensorinput into a set of independentmessages " (figure 3.lb ) . We call the latter case monolithic architecture with distributed " input . The idea is that inputs relevant to different responsescan ; in such away , input messagesare shorter, the go into distinct messages search space of different classifiersis smaller, and the overall learning effort can be reduced(seethe experimenton monolithic architecturewith distributed input in section 4.2) . The only overhead is that each input messagehas to be tagged so that the sensor it comes from can be identified. 3.2.2 Flat Architectures A distributed architectureis made up of more than one LCS. If aU LCSs ' are directly connectedto the agent s sensors, then we use the term flat architecture(figure 3.2) . The idea is that distinct LCSs implement the dif -
Architectures and ShapingPolicies
ftgure3.3 architectures Two-levelhierarchical ferent basic responsesthat make up a complex behavior pattern. There is ' a further issue, here, regarding the way in which the agent s responseis built up from the actions proposed by the distinct LCSs. If such actions are independent, they can be realized in parallel by different effectors (figure 3.2a); for example, by rotating a videocamera while moving that are not independent, on the other hand, have to straight on. Actions f be integrated into a single responsebefore they are realized (figure 3.2b) . For example, the action suggestedby an LCS in charge of following an object and the action suggestedby an LCS in chargeof avoiding obstacles have to be combined into a single physical movement. (One of the many different ways of doing this might be to regard the two actions as vectors and add them.) In the present version of ALEC SYs , the combination of different actions in a fiat architecture cannot be learned but has to be programmed in advance.
3.2.3 Hierarchical Architectures In a flat architecture, all input to LCSs comes from the sensors. In a hierarchical architecture, however, the set of all LCSs can be partitioned into a number of levels, with lower levelsallowed to feed input into higher levels. By definition, an LCS belongsto level N if it receivesinput from systems of level N - I at most, wherelevel 0 is definedas the level of sensors. An N -level hierarchical architecture is a hierarchy of LCSs having level N as the highest one; figure 3.3 shows two different two-level hierarchical architectures. In general, first-level LCSs implement basic behaviors, while higher-level LCSs implement coordination behaviors. With an LCS in a hierarchical architecturewe have two problems: first, how to receiveinput from a lower-level LCS; second, what to do with the
3 Chapter
lfigure3.4 -levelswitcharchitecture Exampleof three for Chase . / Feed/Escapebehavior Besides threebasicbehaviors es, SWIandSW2 , therearetwoswitch . output . Receiving input from a lower-level LCS is easy: both input and output messagesare bit strings of some fixed length, so that an output messageproduced by an LCS can be treated as an input messageby a different LC~. The problem of deciding what to do with the output of LCSs is more complex. In general, the output messagesfrom the lower levels go to higher-level LCSs, while the output messagesfrom the higher levels can go directly to the effectorsto produce the response(seefigure 3.3a) or can be usedto control the composition of responsesproposedby lower LCSs (seefigure 3.3b) . Most of the experimentspresentedin this book were carried out using suppressionas composition rule; we call the resulting hierarchical systems" switch architectures." In figure 3.4, we show an example of three-level switch architecture implementing an agent that has to learn a Chase/ Feed / Escape behavior, that is, a complex behavior taking place in an environment containing a " sexualmate," pieces of food, and possibly a predator. This can be defined as if
there is a predator then Escape else if hungry then F~ { i . e . , search for food } else C~ { i . e . , try to reach the mate } endif end if .
sexual
Architectures and ShapingPolicies
In this example, the coordinator of level two (SWl ) should learn to suppress the Chase behavior whenever the Feed behavior proposes an action, while the coordinator of level three (SW2) should learn to suppress SWI wheneverthe Escape behavior proposesan action.
AN ARCl DTECf URE 3.3 REALIZING
3.3.1 How to Design an Architecture: QuaUtadveCriteria The most general criterion for choosing an architecture is to make the &.\"chitecture naturally match the structure of the ,target, behavior. This meansthat each basic responseshould be assignedan LCS, and that all such LCSs should be connectedin the most natural way to obtain the global behavior. Supposethe agent is normally supposedto follow a light, while being ready to reach its nest if a specificnoise is sensed(revealing the presence of a predator) . This behavior pattern is made up of two basic responses , namely, following a light and reaching the nest, and the relationship betweenthe tWo is one of suppression. In such a case, the switch architecture is a natural choice. In general, the four mechanismsfor building complex behaviorsdefined in section 3.1 map onto different types of architecture as follows: . Independentsum can be achievedthrough a flat architecturewith independent outputs (figure 3.2a) . This choice is the most natural becauseif two behaviors exploit different effectors, they can run asynchronouslyin parallel. . Combination can be achieved either through a flat architecture with integrated outputs (figure 3.2b) or through a hierarchical architecture (figure 3.3). A flat architecture is advisablewhen the global behavior can be reducedto a number of basicbehaviorsthat interact in a uniform way; for example, by producing a linear combination of their outputs. As we have already pointed out, the combination rule has to be explicitly programmed . On the other hand, if the agent has to learn the combination rule, the only possiblechoice with ALECsysis to use a hierarchical architecture : two LCSs can sendtheir outputs to a higher-level LCS in charge of combining the lower-level outputs into the final response. However, complex combination rules may be very difficult to learn. . Suppressioncan be achievedthrough a specialkind of hierarchical architecture " we have called " switch architecture. Suppressionis a simple limit
Chapter 3
caseof combination: two lower-leveloutputs are " combined" by letting one of the two becomethe final response.In this case, the combination rule can be effectively learned by ALECsys, resulting in a switch architecture. . Sequencecan be achievedeither through a fiat or switch architecture or through a hierarchical architecturewith added internal states. As we shall seein chapter 6, there are actually two very different types of sequential behaviors. The first type, which we have called " pseudosequential behavior " , occurs when the perception of changesin the external environment is sufficient to correctly chain the relevant behaviors. For example, exploring a territory , locating food, reaching for it , and bringing it to a nest is a kind of pseudosequentialbehavior becauseeach behavior in the sequencecan be chosensolely on the basis of current sensoryinput . This type of behavior is not substantially different from nonsequential behavior and can be implemented by a fiat or switch architecture. Proper sequentialbehavior occurs when there is not enough information in the environment to chain the basic behaviors in the right way. For example, an agent might have to perform a well-definedsequenceof actions that leave no per' ceivabletrace on the environment, like moving back and forth betweentwo locations. This type of behavior requires somekind of " internal state" to keep track of the different phases the agent goes through. We have implemented proper sequential behavior by adding internal statesto agents, and by allowing an LCS to update such a state within a hierarchical architecture (seechapter 6).
3.3.2 How to Designan Architecture: Quantitative Criteria The main reasonfor introducing an architectureinto LCSs is to speedup the learning of complex behavior patterns. Speedupis the result of factoring a large searchspaceinto smaller ones, so that a distributed architecture will be useful only if the component LCSs have smaller search spacesthan a single LCS able to perform the sametask. We can turn this consideration into a quantitative criterion by observing that the size of a searchspacegrows exponentially with the length of . This implies that a hierarchical architecture can be useful only messages if the lower-level LCSs realize some kind of informational abstraction, thus transforming the input messagesinto shorter ones(see, for example, the experiment on the two-level switch architecture in section 4.4.3) . Consider an architecture in which a basic behavioral module receives from its sensorsfour -bit messagessaying where a light is located. If this basic behavioral module sendsthe upper level four -bit messagesindicat-
Architectures and ShapingPolicies
ing the proposed direction of motion , then the upper level can use the sensorialinformation directly, by-passingthe basic module. Indeed, even if this basic behavioral module learns the correct input -output mapping, it does not operate any information abstraction, and becauseit sendsthe upper level the samenumber of bits it receivesfrom its sensors, it makes the hierarchy computationally useless . We call this the " information " compression requirement. An LCS whoseonly role is to compresssensorymessages can be viewed " " as a virtual sensor, able to extract from physical sensorydata a set of relevant features. As regardsthe actual sensors,we prefer to keep them as simple as possible, even in the simulated experiments: all informational abstraction, if any, takes place within ALECsysand has to be learned. In this way, we make sure that the " intelligence" of our agents is located mainly within ALECSys and that simulated behaviors can be also be achievedby real robots endowedwith low-cost physical sensors. 3.3.3 Implementingan Architecture with ALECSYS With ALECSYSfit is possible to define two classesof learning modules: basic behaviors and coordination behaviors. Both are implemented as learning classifiersystems. Basic behaviors are directly interfaced with the environment. Each basic behavior receivesbit strings as input from sensorsand sends bit stringsto effectorsto proposeactions. Basicbehaviorscan be insertedin a hierarchical architecture; in this case, they also send bit strings to connected higher-level coordination modules. Consider Autono Mouse II and the complex behavior Chase / Feed / Escape , which has already been ,I!lentioned in section 2.3 and will be extensivelyinvestigatedin chapter 4. Figure 3.5 showsone possibleinnate architecture of an agent with such a complex behavior. A basic behavior has been designedfor each of the three behavioral patterns used to describe the learning task. In order to coordinate basic behaviors in situations where two or more of them propose actions simultaneously, a coordination module is used. It receivesa bit string from each connected basic behavior (in this casea one-bit string, the bit indicating whether the sendingICS wants to do something or not) and proposesa coordination action. This coordination action goes into the composition rule module, which implements the composition mechanism. In this example, the composition rule used is suppression, and therefore only one of the basic actions proposedis applied.
3 Chapter Composition rule
coordination action
basicactionsproposed Environment F1gwe3.5 for three-behaviorlearningtask Exampleof innatearchitecture
inputpattern
~ positionof chasedobject
~- ~,1
...... direction of motion
to the coordinator
move/do not move (b)
lCS- Chase
outputpattern (c)
F1gm ' e 3.6 (a) Example of input message ; ( b) example of output messaae; input- output interface for ICS- Chase behavior
of (c) example
All behaviors(both basic and coordination), are implementedas I CSs. For example, the basic behavioral pattern Chase is implemented as an ICS, which for easeof referencewe call " ICS-Chase " (figure 3.5) . Figure 3.6 showsthe input -output interface of ICS-Chase. In this case, the input pattern only sayswhich sensorsseethe light . (Autono Mouse II has four binary sensors,eachof which is set to 1 if it detectsa light intensity higher than a given threshold, and to 0 otherwise.) The output pattern is composed of a proposed action, a direction of motion plus a move/ do not move command, and a bit string (in this caseof length 1) going to ICSCoordinator (figure 3.5) . The bit string is there to let the coordinator
Architectures andShaping Policies know when ICS- Chase is proposing an action. Note that the value of this bit string is not designedbut must be learned by ICS- Chase. In general, coordination behaviors receive input from lower-level behavioral modules and produce an output action that , with different modalities dependingon the composition rule used, in8uencesthe execution of actions proposedby basic behaviors.
3.4 SHAPINGPOLICIES The use of a distributed system, either flat or hierarchical, brings in the new problem of selectinga shapingpolicy, that is, the order in which the various tasks are to be learned. In our work , we identified two extreme choices:
. Holistic shaping, in which the whole learning system is consideredas a black box. The actual behavior of a single learning classifier systemis not used to evaluate how to distribute reinforcements; when the system receivesa reinrprcement, it is given to all of the LCSs within the system architecture. This procedure has the advantage of being independentof the internal architecture, but can make the learning task difficult because there can be ambiguous situations in which the trainer cannot give the correct reinforcementto the component LCSs. A correct action might be the result of two wrong messages . For example, in our Chase/ Feed / task see 3.4 the LCS called SWI might in a feeding situation Escape ( figure ), choose to give control to the wrong basic LCS, that is, LCSChase , which in turn would propose a normally wrong action, " Move " away from the light, that , by chance, happenedto be correct. Nevertheless , this method of distributing reinforcementsis interestingbecauseit doesnot require accessto the internal modulesof the systemand is much more plausible from an ethological point of view. . Modular shaping, in which each LCS is trained with a different reinforcemen program, taking into account the characteristicsof the task that the consideredLCS has to learn. We have found that a good way to implement modular shapingis first to train the basicLCSs, and then, after " " they have reacheda good performance level, to freeze them (that is, basic LCSs are no longer learning, but only performing) and to start training upper-level LCSs. Training basic LCSs is usually done in a separate session(in which training of different basic LCSs can be run in parallel) .
Chapter 3
In principle, training different LCSs separately, that is, modular shaping , makes learning easier, although the shapingpolicy must be designed in a sensibleway. Hierarchical architecturesare particularly sensitiveto the shaping policy; indeed, it seemsreasonable that the coordination modules be shapedafter the lower modules have learned to produce the basicbehaviors. The experimentsin chapter 4 on two-level and three-level switch architectures show that good results are obtained, as we said above, by shaping the lower LCSs, then freezing them and shaping the coordinators, and finally letting all componentsbe free to go on learning together.
3.5 POINTS TO REMEMBER
. Complex behaviorscan be obtained from simpler behaviors through a number of composition mechanisms. . When using ALBCSYS to build the learning control systemof an autonomous agent, a choice must be made regarding the systemarchitecture. We have pre'sentedthree types of architectures: monolithic , distributed fiat , and distributed hierarchical. . The introduction of an LCS module into an architecturemust satisfy an " information " . compression criterion: the input string to the LCS must be longer than the output string sent to the upper-level LCS. . When designingan architecture, a few qualitative criteria regarding the matching betweenthe structure of the behavioral task to be learned and the type of architecture must be met. In particular, the component LCSs of the architecture should correspond to well-specifiedbehavioral tasks, and the chosen architecture should naturally match the structure of the global target behavior. . Distributed architecturescall for a shaping policy . We presentedtwo extremechoices: holistic and modular shaping.
Chapter
4
Experimentsin Simulated Worlds
4.1 I NTRODUcn ON In this chapter we present some results obtained with simulated robots. Becauseit takes far lesstime to run experimentsin simulatedworlds than in the real world , a much greater number of design possibilities can be tested. The following questionsguided the choice of the experiments: . Does ICS perform better than a more standard LCS like LCSo? . Can a simple form of memory help the Autono Mouse to solve problems that require it to rememberwhat happenedin the recent past? . Does decompositionin subtaskshelp the learning process? . How must shaping be structured? Can basic behaviors and coordination of behaviors be learned at the sametime, or is it better to split the learning processinto severaldistinct phases? . Is there any relation betweenthe agent' s architecture and the shaping policy to be used? . Can an inappropriate architectureimpede learning? . Can the different architectural approaches we used in the first experiments be composedto build more complex hierarchical structures? . Can a bad choice of sensorgranularity with respectto the agent' s task impede learning? In the rest of this section, we explain our experimental methodology, illustrate the simulated environments we used to carry out our experiments , and describethe simulated Autono Mouse we used in our experiments . In section4.2, we presentexperimentsrun in a simple environment in which our simulated Autono Mouse had to learn to chase a moving light source. The aim of theseexperimentswas to test ICS and ALEcsys,
Chapter4
and to study a simple memory mechanismwe call " sensormemory." In section4.3, we give a rather completepresentationof a multibehavior task and discussmost of the issuesthat arise when dealing with distributed architectures. In particular, we focus on the interplay betweenthe shaping policy and the architecture chosento implement the learning system. In section4.4, we slightly complicate the overall learning problem by adding a new task, feeding, and by augmentingthe dimensionof the sensoryinput , longer and adding multiple copiesof the sameclass making sensormessages of objects in the environment. We report extensively on experiments comparing different architectural choicesand shapingpolicies. Section4.5 briefly reports on an experimentrun in a slightly more complex environment which will be repeated, using this time a real robot, in chapter 5. 4.1.1 ExperimentalMethodology In all the experimentsin simulatedworldst we use p=
Number of correct responses < - I Totall : 1lnberof responses
as performancemeasure. That i Stperformanceis measuredas the ratio of correct moves to total moves performed from the beginning of the simulation . Note that the notion of " correcttt responseis implicit in the RP: a responseis correct if and only if it receivesa positive reinforcement. " Thereforet we call the above defined ratio the cumulative performance tt measureinduced by the RP. The experimentsreported in sections4.2 and 4.3 were run as follows. Each experimentconsistedof a learning phaseand a test session. The test sessionwas run after the learning phase; during the test sessionthe learning algorithm was switchedoff ( but actions were still chosenprobabilisticallyt using strength) . The index Pt measuredconsideringonly the moves done in the test sessiontwas used to compute the performance achieved by the systemafter learning. For basic behaviorst P was computed only for the movesin which the behaviorswere active; we computed the global performanceas the ratio of globally correct moves to total moves during the whole test sessiontwhere at every cycle (the interval between two sensorreadings) a globally correct move was a move correct with respect to the current goal. Thust for examplet if after ten cyclesthe Chase behavior had been active for 7 cyclest proposing a correct move 4 timest and the Escape behavior had been active for 6 cyclest proposing a correct move 3 timest then the Chase behavior performancewas 4/ 7t the
Experimentsin Simulated Worlds
Escape behavior performance 3/ 6, and the global performance (4 + 3)/ (7 + 6) = 7/ 13. To compare the performancesof different architectural or shaping policies, we usedthe Mann-Whitney nonparametric test (see, for example, Siegeland Castellan 1988), which can be regardedas the non' parametric counterpart of Student s I-test for independentsamples. The choice of nonparametric statistics was motivated by the failure of our data, even if measuredon a numeric scale, to meet the requirementsfor parametric statistics(for example, nonnaI distribution) . The experimentspresentedin section 4.4 were run somewhat differently . First , insteadof presettinga fixed number of learning cycles, we ran an experimentuntil there was at least someevidencethat the performance was unlikely to improve further; this evidencewas collectedautomatically by a steadystatemonitor routine, checkingwhether in the last k cyclesthe performance had significantly changed. In experimentsinvolving multiphase shaping strategies, a new phasewas started when the steady state monitor routine signaledthat learning had reacheda steadystate. Second, experimentswere repeatedseveraltimes (typically five), but typical results were graphedin'steadof averageresults. In fact, the useof the steadystate monitor routine made it difficult to show averagedgraphs becausenew phasesstarted at different cycles in different experiments. Nevertheless , all the graphs obtained were very similar, which makes us confident that the typical results we present are a good approximation of the average behavior of our learning system. The reasonwe usedthis different experimental approach is that we wanted to have a systemthat learned until it could profit from learning, and then was able to switch automatically to the test session. Finally , section 4.5 presents simple exploratory experiments run to provide a basis of comparison for those to be run with a real robot (see section 5.3) . In theseexperimentswe did not distinguish betweena learning and a test session, but ran a single learning sessionof appropriate length (50,000 cyclesin the first experimentsand 5,000 in the second).
4.1.2 Simulation Environments From chapter 3 it is clear that to test all the proposed architecturesand shapingpolicies, we needmany different simulated worlds. As we needa basic task for each basic behavior, in designing the experimental environments , we were guided by the necessityof building environments in which basic tasks, and their coordination, could be learned by the tested agent architecture. We usedthe following environments:
4 Chapter
(a)
(b)
*-light
V/~~~ ~2~~~(,:j ~ light
ell ~ Iat ed Autono Mouse
~
(c )
< ' t / t @ light predator
I 9 "
Figure4.1 : the AutonoMouseand its Simulatedenvironmentsetup: (a) Chaseenvironment visualsensors ; (c) Chase/ Feed/Escape environme ; (b) Chase/Escape environment : AutonoMousedoesnot seelight ; and (d) Find Hidden environment whenit is in the shadedarea. . . . .
Chase environment (single-behavior environment). Chase/ Escape environment (two-behavior environment) . Chase/ Feed /Escape environment (three-behavior environment) . Find Hidden environment (two-behavior environment) .
In the Chase environment (seefigure 4.la , in which the Autono Mouse is shown with its two binary eyes, which eachcover a half-space), the task is to learn to chasea moving object. This environment was studied pri marily to test the learning classifiersystemcapabilities and as a test bed for proposedimprovementsin the LCS model. In particular, the issuewas whether the useof punishmentsis a good thing; we also testedthe optimal length for the messagelist, a parameter that can greatly influence the performanceof the learning system. The Chase environment was usedto
Experimentsin Simulated Worlds
fine-tune the performance of ICS. The environment dimensions, for this and all the following environments, were 640 x 480 pixels, an Autono Mouse' s step was set to 3 pixels, and the maximum speedof the Autono Mouse was the sameas the speedof the light source. In the Chase/Escape environment there are two objects: a light and a lair . There is also a noisy predator, which appears now and then, and which can be heard, but not seen, by the Autono Mouse. When the predator appears, the Autono Mouse, otherwise predisposed to chase the moving light source, must run into its lair and wait there until the predator has gone. The light sourceis always present. The predator appearsat random time intervals, remains in the environment for a random number of cycles, and then disappears. The predator never enters the environment : it is supposedto be far away and its only effect is to scare the Autono Mouse, which runs to hide in its lair . Figure 4.1b shows a snapshot of the environment. In the Chase/ Feed /Escape environment there are three objects: a light, a food source, and a predator. This environment is very similar to the previous onf , but for the presenceof food and for the different robotpredator interaction. As above, the Autono Mouse is predisposedto chase the moving light source. When its distance from the food source is less than a threshold, which we set to 45 pixels, then the Autono Mouse feels hungry and thus focuseson feeding. When the chasingpredator appears, then the Autonomouse's main goal is to run away from it (there is no lair in this environment) . The maximum speedof the Autono Mouse is the same as the speedof the light source and of the chasing predator. The light source and the food are always present (but the food can be seen only when closer than the 45 pixels threshold) . As above, the predator appearsat random time intervals, remains in the environment for a random number of cycles, and then disappears. Figure 4.1c showsa snapshot of the environment. In the Find Hidden environment the Autono Mouse' s task is composed of two subtasks: Chase and Search . As in the Chase environment , the Autono Mouse has to chasea moving light source. The task is complicated by the presenceof a wall. Wheneverit is interposedbetween the light and the Autono Mouse (seefigure 4.id ), the Autono Mouse cannot seethe light any longer, and must Search for the light . 4.1.3 The SimulatedAutonoMice The simulated Autono Mice are the simulated-world counterparts of the real Autono Mice in the experimentspresentedin the next chapter. Their
Chapter 4
limited capabilities are designedto let them learn the simple behavior patterns discussedin the section4.1.2.
4.1.3.1 SelMOryCapabilities In all experimentalenvironmentsexcept the Chase/ Feed /Escape environmen , the simulated Autono Mouse has two eyes. Becauseeach eye covers a visual angle of 180 degreesand the two eyeshave a 90-degree " ' " overlap in the Autono Mouse s forward move direction, the eyespartition the environment into four nonoverlapping regions (seefigure 4.la ) . Autono Mouse eyescan sensethe position of the light source, of food (if it is closer than a given threshold), of the chasing predator, and of the lair " " (we say the Autono Mouse can see the light and the lair ) . We call these " the basic Autono Mouse sensorcapabilities." In our work we chose a particular disposition of sensorswe consideredreasonable. Obviously, a robot endowed with different sensorsor even a different disposition or granularity of existing sensorswould behavedifferently (see, for example, Cliff and Bullock 1993). In one of / he experimentsrun in the Chase environment (the experiment on sensormemory), the two eyes of the Autono Mouse covered a visual cone of 60 degrees ; as a result, the visual , overlapping for 30 degrees 90 in the Autono Mouse was front of , partitioned into degrees space three sectorsof 30 degreeseach. The reasonto restrict the visual field of the Autono Mouse was to make more evident the role of memory in learning to solve the task. In experimentsrun in the Chase/Escape environment, in addition to the basic sensorcapabilities, the Autono Mouse can also detect the difference betweena closelight and a distant light (the sameis true for the lair ), " " and can sensethe presenceof the noisy predator (we say it can hear " " the predator), but cannot see it . The distinction between close and " far " was necessarybecausethe Autono Mouse had to learn two different " " responseswithin the hiding behavior: When the lair is far , approach it ; " When the lair is close the Autono Mouse is into the lair do not move." ], [ The distinction between " close" and " far " was not necessaryfor the chasingbehavior, but was maintained for uniformity . In experiments run in the Chase / Feed / Escape environment, the simulated Autono Mouse has four eyes(like Autono Mouse II ; seefigure 1.3) . Becauseeacheyecoversa visual angle of 90 degreeswithout overlap, the four eyespartition the environment into four nonoverlapping regions. As in the caseof basic sensorcapabilities, Autono Mouse eyescan sense
Experimentsin Simulated Worlds
the position of the light source, of the food (if it is closer than a given threshold), and of the chasingpredator. The reason to have sensorswith different characteristicsfrom those in the previous environments is that we wanted to seewhether there was any significant changein the Autono Mouse' s performancewhen it was receivinglonger than necessarysensory messages(it actually took twice the number of bits to code the same amount of sensoryinformation using four eyesthan it did using two eyes, as in the Chase case). In the Find Hidden environment, the Autono Mouse was enriched with a sonar and two side bumpers. Both the sonar and the bumperswere used as onloff sensors. Autono Mice with two eyesrequire two bits to identify an object (light, lair , or predator) relative position, while Autono Mice with four eyes require four bits. In the experiments where the Autono Mouse differentiates betweena close and a far region, one extra bit of information is added. Moreover, one bit of information is usedfor the sonar and one for each of the bumpers. In the Chas'e environment, the Autono Mouse receivestwo bits of information from sensors to identify light position. In the Chase/ Escape environment, the Autono Mouse receivessevenbits of information from sensors: two bits to identify light position, two bits to identify lair position, one bit for light distance(close/ far), one bit for lair distance (close/ far), and one bit to signal the presenceof the predator. In the Chase/ Feed / Escape environment the Autono Mouse receives twelve bits of information from sensors: four bits to identify light position, four bits to identify lair position, and four bits to identify predator position. In the Find Hidden environment the Autono Mouse receivesfive bits of information from sensors: two bits to identify light position, one bit from the sonar, and one bit from each of the two side bumpers. Table 4.1 summarizesthe sensorycapabilities of the Autono Mouse in the different environments. Pleasenote that two bits of information were added to messagesto distinguish betweensensormessages , internal messages and motor . , messages 4.1.3.2 Motor Capabilities The Autono Mouse has a right and a left motor and it can give the " " " following movement commands to each of them: Stay still ; Move " " " " one step backward ; Move one step forward ; and Move two steps forward " . The maximum speedof the Autono Mouse, that is, two steps
4 Chapter Table4.1 environments Sensory inputin different experimental Sonar Bumpers Total
2 bits
Chaseenvironment
Chase/Escapeenvironment 3 bits Chase/Feed/Escape 4 bits environment
4 bits
tags
2 bits 7 bits 12bits
1 bit 4 bits
2 bits
Find Hidden environment
left motor
3 bits
1bit
2 bit
5 bits
The two bits going to each motor have the following meaning:
right mo~ r
00: 01: 10: 11:
one step backward stay still one step forward two steps forward
figure 4.2
of message Structure goingto motors
per cycle, was set to be the sameas the speedof the moving light or of the chasingpredator (it wasfound that settinga higher speed, say, four stepsper cycle, made the task much easier, while a lower speedmade it impossible) . At every cycle, the Autono Mouse decideswhether to move. As there are four possibleactions for each motor , a move action can be described sendsto the Autono Mouse motors by four bits. The messagesALECSYS are therefore six bits long, the two additional bits being used to identify . The structure of a messagesent to the messagesas motor messages motors is shown in figure 4.2. In the caseof the Find Hidden experiment, we useda different coding : the output interface included two bits, coding the for motor messages " " " " " four following possible moves: still , straight ahead, ahead with a " " " left turn , and aheadwith a right turn. 4.2 EXPERIMENTS IN THE Chase ENVIRONMENT
in the Chase environmentto answerthe following We ran experiments 1 : questions
Experimentsin SimulatedWorlds Table 4.2 Comparison betweenrewards and punishmentstraining policy and only rewards training policy
Rewardsandpunishments Average Std. dev. Average Only rewards Std. dev.
ML = l
ML = 2
ML = 4
ML = 8
0.904 0.021 0.744 0.022
0.883 0.023 0.729 0.028
0.883 0.026 0.726 0.027
0.851 0.026 0.682 0.025
Note: Results obtained in test session; 1,000 cycles run after 1,000 cycles of learning. Values reported are means of performance index P over 20 runs, and standard deviations.
. Is it better to useonly positive rewards, or are punishmentsalso useful? . Are internal messagesuseful? . Does ICS perform. better than LCSo? In particular,'fwe . compared the results obtained using only positive rewards with those obtained using both positive rewardsand punishments; . comparedthe results obtained using messagelists of different lengths; . evaluated the usefulness of applying the GA when the bucket brigade has reacheda steadystate; . evaluatedthe usefulnessof the mutespecoperator; and . evaluatedthe usefulnessof dynamically changing the number of classifiers used. We also ran an experimentdesignedto provide the Autono Mouse with some simple dynamic behavior capabilities, using a general mechanism basedon the idea of sensormemory. 4.2.1 De Role of PuuWlmentsand of Internal Measges Our experimental results have shown (seetable 4.2) that using rewards and punishmentsworks better than using only rewards for any number . Mann-Whitney tests showed that the differences of internal messages observedbetweenthe first and the secondrow of table 4.2 are highly significant for every messagelist (ML ) length tested(p < 0.01) . Resultshave also indicated that , for any given task, a one-messageML is the best. This is shown both by learning curves for the rewards and punishmentspolicy (seefigure 4.3) and by resultsobtained in the test session (see table 4.2, mean performance for different values of ML ). We
.0 0 9 . .0 7 ..5 0 60200 400 60 80 1
Chapter 4
e
-Q -. "~G )
Number of cycles
. - - - - - - - - - -
ML
=
1
. . . . . . . . . . . ML = 2
ML = 4
ML = 8
F1pre4.3 Learningto chaselight source(chasingbehavior ). Comparisonof differentML over20 runs. , usingrewardsandpunishments lengths trainingpolicy; averages studied the statistical significance of the differences between adjacent means by applying the Mann-Whitney test. Sampleswere ordered by increasingML length, and the choice of comparing adjacent sampleswas done a priori (which makes the application of the Mann -Whitney test a correct statistical procedure). The differencesconsideredwere everywhere found to be highly significant (p < 0.01) by the Mann -Whitney test except between ML = 2 and ML = 4. This result is what we expected, given that the consideredbehavior is a stimulus-responseone, which does not require the useof internal messages . Although l:Iolland' s original LCS used internal messages , Wilson ( 1994) has recently proposed a minimal LCS, called " ZCS," which uses none, although to learn dynamic behavior , some researchers(Cliff and Ross 1994) have recently extended Wilson' s ZCS by adding explicit memory (see also our experimentsin chapter 6) . Indeed, it is easy to understand that internal messagescan slow down the convergenceof the classifiers' strength to good values. It is also interesting to note that, using rewards and punishmentsand ML = 1, the systemachievesa good performance level (about 0.8) after only about 210 cycles(seefigure 4.3; 100cyclesdone in about 60 seconds
Experimentsin SimulatedWorlds
Table4.3 of LCSoandLCSo+ energy Comparison
Note: NC = Number of classifiers introduced by application of GA . NGA = Number of cyclesbetweentwo applications of GA . For LCSo+ energy, we report averagenumber of cycles. For LCSo, we call GA every 500 iterations (500 was experimentally found to be good value). P = Perfonnance index averagedover 10 runs.
with 300 rules on 3 transputers). This makesit feasibleto useALBCSYS to control the real Autono Mouse in real time. Given the resultspresentedin this section, all the following experiments were run using punishmentsand rewards and a one-messageML . 4.2.2
Experimental Evaluation of ICS
4.2.2.1 Energy and SteadyState In table 4.3, we report some results obtained comparing LCSo with its derived ICS, which differs only in that the GA is called automatically when the bucket brigade algorithm has reached a steady state. The attainment of a steady state was measuredby the energy variable, introduced in section2.2.1. Similar resultswere obtained in a similar environment in which Autono Mouse had slightly different sensors(Dorigo 1993) . The experiment was repeatedfor different values of the number of new
4 Chapter Table 4.4 Comparison between LCSo without and with dynamical change of number of classifiers
LCSo LCSowith.dynamical changeof number of classifiers
Maximum performance P achieved
Iterations required to achieve P = 0.67
Time required to run to ,O()() iterations (sec)
Averagetime of iteration (sec)
0.85 0.95
3,112 1,797
7,500 3,900
0.75 0.39
Note: 10,000 cyclesper run; resultsaveragedover 10 runs.
classifiersintroduced in the population as a result of applying the GA becausethis was found to be a relevant parameter. Results show that the useof energyto identify a steadystate, and the subsequentcall of the GA when the s~ dy state has been reached, improves LCSo performance, as measuredby the P index. It is also interestingto note that, the number of cycles betweentwo applications of the GA is almost always smaller for LCSo with GA called at steadystate, which meansthat a higher number of classifierswill be testedin the samenumber of cycles.
4.2.2.2 The MutespecOperator Experiments were run to evaluate the usefulness of both the mutespec operator and the dynamical change of the number of classifiers used. Adding mutespecto LCSo resulted in an improvement of both the performan level achieved, and the speedwith which that performancewas obtained. On average, after 100calls of the GA , the performanceP with and without the mutespecoperator was, respectively, 0.94 and 0.85 (the differencewas found to be statistically significant) . Also using mutespec, a performance P = 0.99 was obtained after 500 applications of the GA , while after the samenumber of GA calls, LCSowithout mutespecreached a P = 0.92. 4.2.2.3 Dyuamically Changingdie Number of Oa Bfien Used Experimentshave shown (seetable 4.4) that adding the dynamic change of the number of activatableclassifiersto LCSodecreasesthe averagetime for a cycle by about 50 percent. Moreover, the proposed method causes
Experimentsin SimulatedWorlds
4.2.3 A first Step toward Dynamic Behavior
The experimentsdescribedthus far concernS-R behavior, that is, direct . Clearly, the production of more associationsof stimuli and responses complex behavior patterns involves the ability to deal with dynamic behavior , that is, with input - output associationsthat exploit some kind of internal state. In a dynamic system, a major function of the internal state is memory. Indeed, the limit of S-R behavior is that it can relate a responseonly to the current state of the environment. It must be noted that ALE Csysis not completely without memory; both the strengthsof classifiersand the internal messagesappended to the messagelist embody information about past events. However, it is easy to think of target behaviors that require a much 'more specifickind of memory. For example, if physical objectsare still or move slowly with respectto the agent, their current position is strongly correlated with their previous position. Therefore, how an object was sensedin the past is relevant to the actions to be performed in the present, even if the object is not currently perceived. Supposethat at cycle t the agent sensesa light in the leftmost area of its visual field, and that at cycle t + 1 the light is no longer sensed. This pieceof information is useful to chasethe light becauseat cycle t + 1 ' the light is likely to be out of the agent s visual field on its left. To study whether chasing a light can be made easierby a memory of past perceptions, we have endowed our learning system with a sensor memory, that is, a kind of short-term memory of the state of the Autono ' Mouse s sensors.In order to avoid an ad hoc solution to our problem, we have adopted a sensormemory that functions uniformly for all sensors, regardlessof the task. The idea was to provide the Autono Mouse with a representationof the previous state of each sensor, for a fixed period of time. That is, at any given time t the Autono Mouse can establish, for each sensorS, whether (i ) the state of S has not changed during the last k cycles (where the memoryspank is a parameterto be set by the experimenter); (ii ) the state of S haschangedduring the last k cycles; in this case, enough information is given so that the previous state of S can be reconstructed.
Chapter 4
This design allows us to define a sensormemory that dependson the input interface, but is independentof the target behavior (with the exception of k , whose optimal value can be a function of the task). More precisely , the sensormemory is made up of
. a memory word , isomorphic to the input interface; . an algorithm that updates the memory word at each cycle, in accordance to specifications(i ) and (ii ), on the basis of the current input, the previous memory word, and the number of cycles elapsedfrom the last change; . a mechanismthat appendsthe memory word to the messagelist , with a . specifictag identifying it as a memory message The memory process is described in more detail in algorithm box 4.1. Note that, coherently with our approach, the actual " meaning" of memory messagesmust be learned by the classifier systems. In other words, memory messagesare just one more kind of messages , whose correlation with the overall task has to be discovered by the learning system. The resultsobtained in a typical simulation are reported in figures4.44.6, which compare the performances of the Autono Mouse with and without memory. The memory spanwas k = 10. Figure 4.4 shows that the performance of the Autono Mouse with memory tends to become asymptotically better than the performance of the Autono Mouse without memory, although the learning processis slower. This is easy to explain: the " intellectual task" of the Autono Mouse with memory is harder becausethe role of the memory messages has to be learned; on the other hand, the Autono Mouse can learn about the role of memory only when it becomesrelevant, that is, when the light 2 disappearsfrom sight- and this is a relatively rare event. To show that the role of memory is actually relevant, we have decomposedAutono Mouse' s performance into the pe90rmance produced when the light is not visible (and therefore memory is relevant; see figure 4.5); and the performancewhen the light is visible (and thus memory is superfluous; see 3 figure 4.6) . It can be seenthat in the former casethe performanceof the Autono Mouse with memory is better. We conclude that even a very simple memory systemcan improve the in caseswhere the target behavior is not intrinsically performanceof ALECSYS S-R.
Experimentsin Simulated Worlds Algoridlm Box 4.1 The sensormemory algorithm
Chapter 4
1 0.95
eoueuuo
--c.. 0.9 ~ 0.85 ~
-'-'-. -'-.---.--
--------------.
0.8 0.75 0.7 0.65 0.6 0
30
60
90
120
150
Number of cycles ( thousands) With memory
- - - - - - - - - - - - - Without memory
F1gm ' e 4.4 Chasingmoving light with and without sensormemory
4.3 EXPERIMENTS IN THE Chase/Escape ENVIRONMENT The centralissueinvestigatedin this sectionis the role that the system architectureand the reinforcementprogramplay in making a learning systeman efficientlearner.
In particular, we compare different architectural solutions and shaping policies. The logical structure of the learning systemconsists, in this case, of the two basic behaviors plus coordination of the basic behaviors. (Coordination can be made explicit, as in the switch architecture, or implicit, as in the monolithic architecture.) This logical structure, which can be mapped onto ALECsysin a few different ways, is testedusing both the monolithic architectureand the switch architecture. In the caseof the switch architecture, we also compare holistic and modular shaping. Another goal of this section is to help the reader understand how our experimentswere run by giving a detailed specification of the objects in the simulation environment, the target behaviors, the learning architectures built using ALECSYS , the representation used, and, finally, the reinforcementprogram.
0.95 0.8.5 .-.-----.-.0.75 0.60530 150 60120
Experimentsin SimulatedWorlds
a. . -
eoue
"~m
90
Numberof cycles( thousands )
With
memory
- - - - - - - - - - - - - Without
memory
F1gore4.5 Light - cbasingperformancewhen light is not visible
4.3.1 De Simulation Environment Simulations were run placing the simulated Autono Mouse in a two dimensional environment where there were some objects that it could perceive. The objects are as follows: A moving light source, which moves at a speedset equal to the maximum speed of the simulated Autono Mouse. The light moves along a straight line and bouncesagainst walls (the initial position of the light is random) . . A predator, which appearsperiodically and can only be heard. . The Autono Mouse' s lair , which occupiesthe upper right angle of the environment (seefigure 4.tb ) .
4.3.2 TargetBehanor The goal of the learning system is to have the simulated Autono Mouse
learn the following three behavior patterns: 1. Chasing behavior. The simulated Autono Mouse likes to chase the moving light source.
. _ ~ _ . , '0 . .0 9 :.8I,. 5 5 .0 0 7 5 .6030 560 12 15
Chapter 4
--a.. ~ ~ 8 iE 0; or l
90
Numberof cycles( thousands )
With memory
- - - - - - - - - - - - - Without memory
4.6 F1gare whenlightis visible Light-chasing perfonnance 2. Escaping behavior. The simulated Autono Mouse occasionally hears the sound of a predator. Its main goal then becomesto reach the lair as soon as possibleand to stay there until the predator goesaway. 3. Global behavior. A major problem for the learning systemis not only to learn single behaviors, but also to learn to coordinate them. As we will see, this can be accomplishedin different ways. If the learning processis successful , the resulting global behavior should be as follows: the simulatedAutono Mouse chasesthe light source; when it happensto hear a predator, it suddenlygives up chasing, runs to the lair , and staysthere until the predator goesaway. In these simulations the Autono Mouse can hear, but cannot see, the predator. Therefore, to avoid conflict situations like having the predator betweenthe agent and the lair , we make the implicit assumptionthat the Autono Mouse can hear the predator when the predator is still very far away, so that it always has the time to reach the lair before the predator entersthe computer monitor .
Experimentsin Simulated Worlds
4.3.3 The I . . eamingArchitectures: MoooUthic and Hierarchical Experimentswere run with both a monolithic architecture(seefigure 3.1) and a switch architecture (see figures 3.3 and 3.4) . In the monolithic architecture the learning systemis implementedas a single low -level par" " allellearning classifiersystem(on three transputers), called LCS-global. In the switch architecture, the learning system is implemented as a set of three classifiersystemsorganized in a hierarchy: two classifier systems learn the basicbehaviors(Chase and Escape ), while one learnsto switch betweenthe two basic behaviors. To implement the switch architecture, we took advantage of the high-level parallelism facility of ALECSvs; we used one transputer for the chasing behavior, one transputer for the escapingbehavior, and one transputer to learn to switch.
4.3.4 Representation As we have seen in chapter 2, ALECsys classifiershave rules with two condition parts and one action part . Conditions are connected by an AND operatof; a classifierentersthe activation state if and only if it has both conditions matched by at least a message . Conditions are strings on k k { O, I , # } , and actions are strings on { O, I } . The value ofk is set to be the sameas the length of the longest among sensorand motor messages . In our experiments the length of motor messagesis always six bits, while the length of sensormessagesdependson the type of architecture chosen. When using the monolithic architecture, sensormessagesare composed of nine bits, sevenbits used for sensoryinformation and two bits for the .4 ( Therefore, a tag, which identifiesthe messageas being a sensormessage . = classifierwill be 3 9 27 bits long.) In the caseof switch architecture, sensoryinformation is split to create two messages , a five-bit messagefor the chasing behavioral module (two bits for light position, one bit for light distance, and two bits as a tag), and a six-bit messagefor the escapingbehavioral module (two bits for light position, one bit for lair distance, one bit for the predator presencemessage , and two bits as a tag). When using a switch architecture, it is also necessaryto define the format of the interface messagesamong basic LCSs and switches. At every cycle each basic LCS sends, besidesthe messagedirected to motors if that is the case, a one-bit messageto the switch. This one-bit message is intended to signal the switch the intention to propose an action. A messagecomposed by all the bits coming from basic LCSs (a two-bit
4 Chapter
lair ion tags distance posit
ofdistance ligh1 light presence ion the poAit predator Fi~ 4.7 fromsensory architecture Structure of message inputusingmonolithic
tags
light light distancepo~t ion
distance
presence of the predator
posit ion
tags
messagein our experiment: one bit from the Chase LCS and one bit from the Escape LCS) goes as input to the switch. Hence the switch selectsone of the basic LCSs, which is then allowed to send its action to the Autono Mouse motors. The value and the meaning of the bit sent from the basic LCSs to the switch is not predefined, but is learned by the system. In figure 4.7, we show the format of a messagefrom sensoryinput in the monolithic architecture; in figure 4.8, the format of a messagefrom sensoryinput to basic LCSs in the switch architecture. Consider, for example, figure 4.8b, which showsthe format of the sensor messagegoing to the Escape LCS. The messagehas six bits. Bits one and two are tags that indicate whether the messagecomesfrom the environmen or was generatedby a classifier at the preceding time step (in which casethey distinguish betweena messagethat causedan action and one that did not) . Bits three and four indicate lair position using the following encoding:
in SimulatedWorlds Experiments 00: no lair , 01: lair on the right , 10: lair on the left, 11: lair ahead. Bit five indicates lair distance; it is set to 1 if the lair is far away and to 0 if the lair is very close. Bit six is set to 1 if the Autono Mouse hears the sound produced by an approachingpredator; to 0 otherwise.
4.3.5 The ReiDforcementProgram As we have seenin section 1.2, the role of the trainer (i.e., the RP), is to observethe learning agent' s actions and to provide reinforcementsafter each action. An interesting issueregarding the use of the trainer is how the RP should be structured to make the best use of the architecture. To answer this question, we ran experiments to compare the monolithic and the switch architecture. Our RP comprises two procedures called " RP-Chase " " " pod RP-Escape . RP-Chase gives the Autono Mouse a reward if it moves so that its distance from the light source does not increase. On the other hand, if the distance from the light source increases . RP-Escape , which is used when the , it punishes- A LEC SYs predator is present, rewardsthe Autono Mouse when it makesa move that diminishesthe distancefrom the lair , if the lair is far ; or when it staysstill , if the lair is close. Although the sameRP is usedfor both the monolithic architecture and the switch architecture, when using the switch architecture a decision must be made about how to shapethe learning system. 4.3.6 Choiceof an Architedure and a ShapingPolicy The number of computing cycles a learning classifier systemrequires to learn to solve a given task is a function of the length of its rules, which in turn dependson the complexity of the task. A straightforward way to reduce this complexity, which was used in the experiments reported below, is to split the task into many simpler learning tasks. Wheneverthis can be done, the learning system performance should improve. This is what is testedby the following experiments, where we comparethe monolithic and the switch architectures, which were described in chapter 3. Also, the use of a distributed behavioral architecture calls for the choice of a shaping policy; we ran experimentswith both holistic and modular shaping.
Chapter 4 Table 4.5 Comparison acrossarchitecturesduring test session Modular -switchHolistic-switch Monolithic architecture architecture long architecture A vI .
Std.dev.
Chasing 0.830 0.169 Escaping 0.744 0.228 0.784 0.121 Global
A VI .
Std.dev.
0.941 0.053 0.919 0.085 0.933 0.073
A VI .
Std.dev.
0.980 0.015 0.978 0.022 0.984 0.022
Modular -switchshort architecture A VI .
Std.dev.
0.942 0.034 0.920 0.077 0.931 0.085
Note: Values reported are for perfonnanceindex P averagedover 20 runs.
In experiment A , we compared the monolithic architecture and the " " switch architecturewith holistic shaping( holistic-switch ). Differencesin performancebetweenthesetwo architectures, if any, are due only to the different architectural organization. Each architecturewas run for 15,000 learning cycles, and was followed by a 1,000- cycle test session. In experim! nt B, we ran two subexperiments( Bl and B2) using the " " switch architecture with modular shaping ( modular-switch ). In subexperim B 1, we gavethe modular-switch architectureroughly the same amount of computing resourcesas in experimentA , so as to allow a fair comparison. We ran a total of 15,000 cycles: 5,000 cycles to train the chasing behavioral module, 5,000 cyclesto train the escapingbehavioral module, and 5,000 cyclesto train the behavior coordination module (for " easeof reference, we call the architecture of this experiment modular" switch-long ) . It is important to note that , although the number of learning ) cycleswas the sameas in experiment A , the actual time (in seconds required was shorter becausethe two basic behaviorscould be trained in parallel. Each learning phasewas followed by a 1,000-cycletest session. In subexperiment B2, we tried to find out what was the minimal amount of resourcesto give to the modular-switch architecture to let it reach the same performance level as the best performing of the two architecturesused in experimentA . We ran a total of 6,000 cycles: 2,000 cycles to train the chasing behavioral module, 2,000 cycles to train the escapingbehavioral module, and 2,000 cyclesto train the behavior coordination " " module; we refer to this architecture as modular-switch-short. - cycletest session. Again, each learning phasewas followed by a I ,OOO Results regarding the test sessionare reported in table 4.5. The bestperforming architecturewas the modular-switch. It performed best given
in Simulated Worlds Experiments
approximately the sameamount of resourcesduring learning, and it was able to achievethe sameperformance as the holistic-switch architecture using much lesscomputing time. Theseresultscan be explainedby the fact that in the switcharchitectureeachsingleLCS, having shorterclassifiers,has a smaller searchspace;5 therefore, the overall learning task is easier. The worst-performing architecture was the monolithic. Theseresults are consistent with those obtained in a similar environment, presentedin section 4.4. We studied the significanceof the difference in mean performance betweenthe monolithic and the holistic-switch architectures, and between the holistic-switch and the modular-switch-long architectures(the choiceof which architecturesto comparewas made a priori ) . Both differenceswere found to be highly significant (p < 0.01) by the Mann -Whitney test for all the behavioral modules (Chase , Escape , and the global behavior). 4.4 EXPERIMENTS IN THE Chase /Feed/ Escape ENVIRONMENT In this section , we systematically compare different architecture and shaping policY combinations using the Chase/ Feed/ Escape environ-
menteWe try to find a confirmation to some of the answerswe already had from the experimentspresentedin the previous section. Additionally , we make a first step toward studying the architecture scalability and the importance of choosinga good mapping betweenthe structured behavior and the hierarchical architecturethat implementsit .
4.4.1 MonoUtbic Architecture Becausethe monolithic architecture is the most straightforward way to apply LCSs to a learning problem, results obtained with the monolithic architecturewere usedas a referenceto evaluatewhether we would obtain improved performance by decomposing the overall task into simpler subtasks, by using a hierarchical architecture, or both. In an attempt to be impartial in comparing the different approaches, we adopted the same number of transputers in every experiment. And because , for a given number of processors , the systemperformance is dependenton the way the physical processors are interconnected, that is, on the hardware architecture, we chose our hardware architecture only after a careful experimental investigation (discussedin Piroddi and Rusconi 1992; and Camilli et ale 1990). Figure 4.9 shows the typical result. An important observation is that the performanceof the Escape behavior is higher than the performance
Chapter4
90ue
-Q -. "G i) >
1 0.9
'----._--~ ,-.---~ .,.............-..-...-....,.....
0.8 0.7 0.6
0.5 0.4 0.3 20
30
40
50
60
70
80
Number of cycles ( thousands) - - - - - . Chase
. . . . . ' Feed
Escape
Global
figure 4.9 Cumulative performanceof typical experimentusing monolithic architecture
of the Chase behavior, which in turn is higher than that of the Feed behavior. This result holds for all the experiments with all the architectures . The reasonsare twofold . The Escape behavior is easierto learn becauseour learning agent can choosethe escapingmovement among 5 out of 8 possible directions, while the correct directions to Chase an object are, for our Autono Mouse, 3 out of 8 (seefigure 4.10) . The lower performanceof the Feed behavior is explained by the fact that, in our experiments, the Autono Mouse could see the object to be chasedand the predator from any distance, while the ft?od could be seen only when closer than a given threshold. This causeda much lower frequency of activation of Feed , which resultedin a slower learning rate for that behavior. Another observationis that, after an initial , very Quickimprovement of performance, both basic and global performancessettled to an approximately constant value, far from optimality . Mer 80,000 cycles, the global performancereachedthe value 0.72 and did not change(we ran the experiment up to 300,000 cycles without observing any improvement) . Indeed, given the length of the classifiersin the monolithic architecture, the searchspace(cardinality of the set of possibledifferent classifiers) is
Experimentsin SimulatedWorlds
@ predator
~ light
light- chasing directions
predator-escaping direct ions F1gare4.10 Difference betweenchasingand escapingbehaviors
huge: the genepcalgorithm, togetherwith the apportionment of credit system , appearsunable to searchthis spacein a reasonableamount of time.
4.4.2 MonoUdlie Architecture with Distributed Input
With distributed input , environmental messagesare shorter than in the previous case, and we therefore expect a better perfonnance. More than one messagecan be appendedto the messagelist at eachcycle (maximum three messages , one for eachbasic behavior) . The results, shown in figure 4.11, confirmed our expectations: global perfonnance settled to 0.86 after 80,000 cycles, and both the Chase and Escape behaviors reachedhigher perfonnance levels than with the previous monolithic architecture. Only the Feed behavior did not improve its perfonnance. This was partially due to the shortenedrun of the experiment . In fact, in longer experiments, in which it could be tested adequately , the Feed behavior reached a higher level of perfonnance, comparable to that of the Chase behavior. It is also interesting to note that the graph qualitatively differs from that of figure 4.9; after the initial steepincrease, perfonnance continues slowly to improve,' a sign that the learning algorithms are effectivelysearchingthe classifiers space.
4.4.3 Two-level Switch Architecture
In this experimentwe useda two-level hierarchical architecture, in which the coordination behavior implementedsuppression.The results, reported
Chapter 4
1
--Q . 0.9 -as0.8 ~ 0.7
.
e
-,.." ,-
0.6 0.5 0.4 0.3
....................., . .... ... .. . .. .. . .. . .. . .' .....,..... ~ ~ ' ", .., , 0 10 20 30 40 50 60 70 80 Number of cyclesi ( thousands) . . . . . . Chase
. . . . . . Feed
Escape
Global
figure4.11 Cumulative of typicalexperiment architecture pe/rormance with usingmonolithic distributed input in figures 4.12 and 4.13, give the following interesting information . First , as shown in figure 4.12, where we report the performance of the three basic behaviorsand of the coo~dinator (switch) in the first 50,000 cycles, and the global performance from cycle 50,000 to the end of the experiment , the use of the holistic shaping policy resultsin a final performance that is comparableto that obtained with the monolithic architecture. This is because , with holistic shaping, reinforcementsobtained by each individual LCS are very noisy. With this shaping policy, we give each LCS the same reinforcement, computed from observing the global behavior, which meansthere are occasionswhen a double mistake results in acorrect , and therefore rewarded, final action. Consider the situation where Escape is active and proposesa (wrong) move toward the predator, but the coordinator fails to choosethe Escape module and choosesinstead the Chase module, which in turn proposesa move away from the chased object (wrong move), say in the direction opposite to that of the predator. The result is a correct move (away from the predator) obtained by the composition of a wrong selection of the coordinator with a wrong proposed move of two basic behaviors. It is easyto understandthat it is difficult to learn good strategieswith such a reinforcementfunction.
in SimulatedWorlds Experiments 1 '6 - :: "i > ) -Q Q ) u ; E '-0 'Q ) a..
0.9 0.8
--"."----, , , ~ , ~.~II.---"-..--~. . I _ . . 0.6 -,' ,,.~ ....,...,...........-......."- ..-. 0.5 .,J ....,.. I \~\ .r 0.4 J ' , / ~/ 0.3 0 10 20 30 40 50 60 70 80
0.7
Number of cycles ( thousands) _ _ _ _ 0
Chase
. . . . . .
Feed
Escape -
-
Switch
tic - e 4.12 Cumulativepe~ rmanceof typicalexperimentusingtwo- level . Holisticshaping
Global
switch architecture .
Second , using the modular shaping policy, performance improves, as expected. The graph of figure 4.13 shows three different phases. During the first 33,000 cycles, the three basic behaviors were independently learned. Between cycles 33,000 and 48,000, they were frozen, that is, learning algorithms were deactivated, and only the coordinator was learning. After cycle48,000, all the componentswere free to learn, and we observed the global performance. The maximum global performance value obtained with this architecturewas 0.84. Table 4.6 siJmmarizesthe results from using the monolithic and two level architecturesalready presentedin figures 4.9, 4.11, 4.12, and 4.13. The table showsthat, from the point of view of global behavior, the best resultswere obtained by the monolithic architecturewith distributed input and by the two-level switch architecture with modular shaping. The performanc of the basic behaviors in the secondwas always better. In particular , the Feed behavior achieved a much higher performance level; using the two-level switch architecture with modular shaping, each basic behavior is fully tested independently of the others, and therefore the Feed behavior has enough time to learn its task. It is also interesting to note that the monolithic architectureand the two -level switch architecture with holistic shapinghave roughly the sameperformance.
Chapter 4
9
-a.. Q >)) Q
1 ---' ~"..' ' -..-'-,... 0.9 "'1-.,-...\~-.-.r .6, 0.8 .~ ; .... ~ ~ f 0.7 0.6 0.5 0.4 0.3 0 10 20 30 40 50 60 70 80 Number of cycles ( thousands )
Chase
. . . . . .
Feed
Escape
Jiigure4.13 Cumulativepetformance of typicalexperiment Modularshaping .
-
-
Switch
Global
using two- level switch architecture .
Warlds in Simulated Experiments 1 -Q. ~ ~() m E .g If
0.9 0.8 ~ \\ - --. 1..8\~1 ---. -_...--. ...---- . . ----------_ .. . . . . . . I.. ,.'-.-,,,,-,~ ~"~.. .. . .. ,.. .~. ~ \ ~ "' - - - - - ..... - ..... - - - - 1f.,,1 ....,,",~ V ---T' , " "' ....' " ; ~ ..... .... .-. .. ... ....- ..........
0.7 0.6 0.5 0.4 0.3 0
20
60
40
80
100
120
Number of cycles ( thousands )
- - - - - . Chase - - SW1
. . . . . . Feed -
-
SW 2
Escape Global
flgm' e 4.14 Cumulative performance of typical experiment using three-level switch architecture of figure 3.4. Holistic shaping. 4.4.4
1bree- level Switch Architecture The three-level switch architecture stretches to the limit the hierarchical approach (a three-behavior architecturewith more than three levelslooks absurd) . Within this architecture, the coordinator used in the previous architecture was split into two simpler binary coordinators (see figure 3.4) . Using holistic shaping, results have shown that two- or three-level architecturesare comparable (seefigures 4.12 and 4.14, and table 4.7) . More interesting are the results obtained with modular shaping. Because we have three levels, we can organize modular shaping in two or three phases. With two-phase modular shaping, we follow basically the same procedureusedwith the two-level hierarchical architecture: in the second phase, basic behavioral modules are frozen and the two coordinators learn at the sametime. In three-phasemodular shaping, the secondphase is devotedto shapethe second-level coordinator (all the other modulesare frozen), while in the third phase, the third -level coordinator alone learns. Somewhatsurprisingly, the results show that, given the sameamount of resources(computation time in seconds ), two-phasemodular shapinggave slightly better results. With two-phasemodular shaping, both coordination
Chapter4 Table4.7 Comparisonof two- andthree-levelhierarchicalarchitectures
Note: Valuesreportedarefor performance indexP, with numberof iterationsin . underlyingpar&ntheses
behaviorsare learning for the whole learning interval, whereaswith threephasemodular shaping, the learning interval is split into two parts during which only one of the two coordinators is learning, and therefore the two switches cannot adapt to each other. The graph in figure 4.15 shows the very high performancelevel obtained with two-phasemodular shaping. Table 4.7 comparesthe results obtained with the two- and three-level switch architectures: we report the performance of basic behaviors, of switches, and of the global behavior, as measuredafter k iterations, where k is the number in parenthesesbelow eachperformancevalue. Remember that , as we already said, the number of iterations varied across experiments due to the steady state monitor routine, which automatically decidedwhen to shift to a new phaseand when to stop the experiment. We have run another experimentusing a three-level switch architecture to show that the choice of an architecture that doesnot correspondnaturally to the structure of the target behavior leads to poor performance. The architecturewe have adopted is reported in figure 4.16; it is similar to that of figure 3.4, exceptthat the two basic behaviors Feed and Escape have been swapped. It can be expectedthat the new distribution of tasks betweenSWI and SW2 will impedelearning: becauseSW2 doesnot know whether SWI is proposing a Chase or an Escape action, it cannot
in SimulatedWorlds Experiments
------. " , . , ' .. . r.,...'. I ~ . ~ , " ( ,.--.."- ..- - ~ I'.I'.'..', .I' ~ / I1,. \\
1 0.9
eoue
Q) > Q)
0.8 0.7 0.6 0.5 0.4 0.3
0 20 40 60 80 100120 Number of cycles ( thousands) - - - - - . Chase - SW 1
-
. . . . . . Feed - - SW 2
Escape Global
figure 4.15 Cumulativeperformance of typical experiment uisng three-level switch architecture of figure 3.4. Two- phasemodular shaping.
Environment
figure 4.16 Three- level modules
with " unnatural switcharchitecture
"
disposition of coordination
Chapter4 1 -a:"i > ) -Q Q ()) c tG E ~ ~~ Q ) no
0.9 0.8 0.7 0.6 0.5
-.--. ......-... ~ . .. , . . . .." .t,-"'~J"\- ,o'r- - "'"",- - ( 1\I\.'"..,---. '",.,.- - ..-.II'
0.4 0.3
0
40
60
80
100
120
Number of cycles ( thousands ) - - . . . . Chase -
-
SW 1
. . . . . . Feed -
-
SW 2
Escape Global
figure 4.17 Cumulativeperformanceof typical experimentusing " unnatural" three-level switcharchitecture . Two- phasemodularshaping . decide (and therefore learn) whether to suppress SWI or the Feed behavioral module. Resultsare shown in figures 4.17 and 4.18. As in the precedingexperiment , two- phaseshaping gave better results than three-phase. It is clear from figure 4.18 that the low level of global performance achievedwas due to the impossibility for SW2 to learn to coordinate the SWI and the Feed modules.
4.4.5 Thel8IIe of Scalability The experiment presented in this subsection combines a monolithic architecturewith multiple inputs and a two-level hierarchical architecture. We use a Chase/ Feed /Escape environment in which four instancesof each class of objects (lights, food pieces, and predators) are scattered around. Only one instancein each classis relevant for the learning agent (i .e., the Autono Mouse likes only one of the four light colon: and one of the four kinds of food, and fearsonly one of the four potenti ~ predators) . Therefore the basic behaviors, in addition to the basic behavioral pattern, have to learn to discriminate betweendifferent objectsof the sameclass.
in SimulatedWorlds Experiments
8
-a -. G )) > G
1 0.9 0.8 0.7
_..~.~ /' -"" ' . .. .. _ 4 . . / ... . , "~ II '. ',.,...
0.6 0.5 0.4 0.3
0
10 20 30 40 50 60 70 80 Number of cycles ( thousands ) - - - - - Chase - SW 1
. . . . . Feed - SW 2
Escape Global
F1pre4.18 " " Cumulativeperformance of typical experiment using unnatural
three -level
. switch architecture. Tbree- pbasemodular shaping
For example, the Escape behavior, instead of receiving a single message ), now receives indicating the position of the predator (when present " ~nimals " different , only one of which many messagesregarding many " " are is a real predator. Different animals distinguished by a tag, and Escape must learn to run away only from the real predator (it should be unresponsiveto other animals) . Our experiments show (see figure 4.19) that the Autono Mouse learns the new, more complex, task, although a greater number of cycles is necessaryto reach a performance level comparable to that obtained in the previous experiment 4.13 ). (figure 4.5 EXPERIMENTS IN THE lind Bidden
ENVIRONMENT
The aims of the Find Hidden experiment are twofold . First, we are interestedto seewhether our systemis capable of learning a reasonably complex task, involving obstacle detection by sonar and bumpers and searching for a hidden light . The target behavior is to chase a light (Chase behavior), searchingfor it behind a wall when necessary(Search
Chapter4 1 a -:"i > ) -Q Q ) u ; E ~ ~~ Q ) Q.
. , _ , . . . : ~ .,,,\'..',-?--.-.-. ,.'.-,.--,-.,. JI,.-" t' Ir rI
0.9 0.8 0.7 0.6 0.5 0.4 0.3
0 40 80 120160200240260 Number of cycles ( thousands )
Chase
. . . . . .
Feed
Escape
-
-
Switch
Global
F1~ 4.19 Cumulative ~ rformance of typical experimentwith multi -input , two - level switch architecture. Modular shaping. 1 a.. Q) > ..!
.r,
. ' . . ., . . . . r , . . " " ' , . . . . . " " . . . - .. .. . . . . .
0 .9
Q) U 08. C tU E ... 0 0 .7 ~ ~
I
J
'" .. ...
- - - - - - .. - .. -_ ft _ -~ - - - ~ ... - ' ~ " " -
I I
0 .6 0
10
20
30
40
50
Number of cycles ( thousands ) . . . . . . Chase
- - - Search
Ji1gure4. 20 Cumulative perfonnancefor Find Hidden task
Rnd Hidden
Experimentsin SimulatedWorlds
behavior) . The waIl has a fixed position, while the light automatically hides itself behind the waIl each time it is reachedby the Autono Mouse. Second, we are interestedin the relation betweenthe sensorgranularity, that is, the number of different statesa sensorcan discriminate, and the learning performance. In theseexperiments, we pay no attention to the issueof architecture, adopting a simple monolithic LCS throughout.
4.5.1 The Find Bidden Task
To shapethe Autono Mouse in the Find Hidden environment, the reinforcemen program was written with the following strategy in mind: is seen if 'a light behavior } Approach it then { C~ behavior } Search else { is sensed by sonars obstacle if a distal then Approach it is sensed by obstacle else if a proximal bumpers then Move along it else Turn consistently { go on in the same direction , turning it is } whichever endif endif endif . The distancesof the Autono Mouse from the light and from the wall are computedfrom the geometriccoordinatesof the simulatedobjects. The simulation wasrun for 50,000cycles. In figure 4.20, we separatelyshow the Chase behavior performance(when the light is visible), the Search behavior performance(when the light is not visible), and the Find Hidden performance. Chasingthe light appearsto be easierto learn than searching for it . This is easyto explain given that searchingfor the light is a rather complex task, involving moving toward the wall , moving along it , and . It will be interesting to compare turning around when no obstacleis sensed real robot (seesection5.3) . with the obtained those with theseresults
4.5.2. SeDsorGranularity and LeamiDg Performance This experiment explored the effect that different levels of sensorgranularity have on the learning speed. To make things easier, we concen-
Chapter 4
trated on the sonar; very similar results were obtained using the light sensors. As a test problem, we choseto let the simulated Autono Mouse learn a modified version of the Search task. The reinforcementprogram usedwas as follows. case distance of very _close : if close : if far : if very _far :
if
backward then reward else punish ; turn then reward else punish ; slowly forward then reward else punish ; forward then reward else quickly punish
endcase .
In our experiments, robot sensorswere used by both the trainer and the learner. In fact, the trainer always used the sensorinformation with the maximum available granularity (in the sonar case, 256 levels) becausein this way it could more accuratelyjudge the learner performance. On the other hand, die learner had to use the minimum amount of information necessaryto make learning feasiblebecausea higher amount of information would make learning slower, without improving the long-term performan . The correspondencebetween the distance variable and the sonar reading was set by the trainer as follows: distance : = very _close , if sensor _readinge [ 0 . . . 63 ] , distance distance distance
: = close , if sensor _readinge [ 64 . . . 127 ] , : = far , if sensor _readinge [ 128 . . . 191 ] , = : very _far , if sensor _readinge [ 192 . . . 256 ] .
We investigatedthe learning performancewith the simulatedrobot for the : 2, 4, 8, 16, 32, 64, 128, 256. following granularities of the sonar messages This meansthat the learner was provided with messagesfrom the sonar that ranged from a very coarse granularity (one-bit messages , that is, division of the to bit binary space) very high granularity (eight , messages that is, 256 levels, the sameas the trainer) . Figure 4.21 showsthe results obtained with the simulated robot. We also report the performance of a random controller for comparison. Results show that, as expected, when the sonar was used to discri~ ate among four levels, the performancewas the best. In fact, this granularity level matched the four levels used by the trainer to give reinforcem . It is interesting to note that, even with a too-low granularity
-gr --.:-:-~ -.:~ --~ gr -16 -..4 :~-.= ::.-~ .~ .-_ .-r .r-.2 .-~ 6 g -0~--1_ .--2 --_ .3_ .5 -ran .._.4 _ ---_
in SimulatedWorlds Experiments
-Q -. G )) > G
1
9
0.8
0.6 0.4 0.2 0
128 -
-
Numberof cycles( thousands )
4.21 F1aare P with simulatedAutonoGranularityof sonarsensorand learningperfonnance Mouse
(i .e., granularity = 2), the robot was able to perform better than in the random case. Increasingthe granularity decreasedmonotonically the performance achieved by the learner. It is also interesting to note that the learner, in those caseswhere the granularity was higher than necessary " ' " (> 4), was able to create rules that (by the use of don t care symbols matching the lower positions of input messages ) clustered together all inputs belonging to the sameoutput classas definedby the trainer policy .
4.6 POINTSTO REMEMBER . ALECSYs allows the Autono Mouse to learn a set of classifiersthat are tantamount to solving the learning problemsposedby the simple taskswe
haveworkedwith in this chapter.
. Our resultssuggestthat it is better to userewardsand punishmentsthan only rewards. . Using the mutespec operator, dynamically changing the number of cla$Sifiers, and calling the geneticalgorithm when the bucket brigade has reacheda steadystate all improve the LCS' s performance. . Autono Mouse can benefit from a simple type of memory called " sensor " memory. . The factorization of the learning task into severalsimpler learning tasks helps. This was shown by the resultsof our experiments.
4 Chapter
. The system designer must take two architectural decisions: how to decomposethe learning problem; and how to make the resulting modules interact. The first issueis a matter of efficiency: a basic behavior should not be too difficult to learn. In our system, this means that classifiers should be no longer than about 30 bits (and thereforemessagescannot be longer than 10 bits) . The secondissueis a matter of efficiency, comprehensib , and learnability. . The best shaping policy, at least in the kind of environments investigated in this chapter, is modular shaping. . The granularity of sensormessagesis best when it matches the granularity chosenby the trainer to discriminate betweendifferent reinforcement actions.
Chapter
5
Experiments
in the Real
World
S.1 I NTRODUCfI ON Moving from simulated to real environments is challenging. New issues arise: sensornoise can appear, motors can be inaccurate, and in general there can be differencesof any sort between the designedand the real function of the robot. Moreover, not only do the robot ' s sensorsand effectors become noisy, but also the RP must rely on real, and hence noisy, sensorsto evaluate the learning robot moves. It is therefore important to determine to what extent the real robot can use the ideas and the software usedin simulations. Although the above problems can be studied from many different points of view, running experimentswith real robots has constrainedus to limit the extent of our investigation. In section 5.2, we focus on the role learn a good control policy : we that the trainer plays in making ALECSYS in :ftuence the robustnessto noise and to how different trainers can study lesions in the real Autono Mouse. We present results from experiments in the real world using Autono Mouse II . In section 5.3, we repeat, with the real robot (Autono Mouse IV ), the Find Hidden experiment and the sensorgranularity experimentalready run in simulation. Results obtained with simulations presentedin the preceding chapter suggestsetting the messagelist length to 1, using 300 classifierson a threetransputerconfiguration, and using both rewardsand punishments. In particular , we experimentally found that learning tended to be faster when punishmentswere somewhat higher (in absolute value) than rewards; a typical choice would be using a reinforcementof + 10 for rewards, and of - 15 for punishments. In all the experimentspresentedin this chapter, we usedthe monolithic architecture.
5 Chapter , " left
frontal
left wheel and motor
eye
" , , ,
rear left eye
, , ,
left frontal visual cone
, ,
,\ ,, ,, frontal right eye .
frontal
~
"
-
-
Raj
-
-
rear
.
cent
C
~
"
"
,
~
raj
cent
~
'
"
eye
eye
rear right
! eye /
right frontal visual cone
right wheel and mot or cent rat wheel
F1gore S.l Functional schematic of Autono Mouse II
.5.2 EXPERIMENTS WITH AUTONOMOUSED 5.2.1 AutonoMoi Be II Hardware Our smallestrobot is Autono Mouse II (seepicture in figure 1.3 and sketch in figure 5.1) . It has four directional eyes, a microphone, and two motors. Each directional eye can sensea light source within a cone of about 60 . Each motor can stay still or move the connectedwheel one or degrees two stepsfo~ ~ d, or one step backward. Autono Mouse n is connected to a transputer board on a PC via a 9,600-baud RS-232 link . Only a small amount of processingis done onboard (the collection of data from sensors and to effectors and the managementof communications with the PC) . All the learning algorithms are run on the transputer board. The directional eyessenselight sourcewhen it is in either or both eyes' field of view. In order to changethis field of view, Autono Mouse II must rotate becausethe eyescannot do so independently; they are fixed with respectto the robot body. The sensorvalue of eacheye is either on or off , dependingon the relation of the light intensity to a threshold. The robot is also provided with two central eyes that can senselight within a halfspace; theseare usedsolely by the trainer while implementing thereward the-result policy . These central eyes evaluate the absolute increase or decreasein light intensity (they distinguish 256 levels) . The format of input messagesis given in figure 5.2. We used three messageformats
Experimentsin the Real World
tas light position (frontal eyes) (a)
position ofnoise (rear ) tags eyes tags light presence light position (frontal eyes) (b)
ion posit ofnoise presence frontal (light ) eyes
figure 5.2 in AutonoMouseII : (a) formatof sensormessages from sensors Meaningof messages in first andsecondexperiments ; AutonoMousen usesonly two frontaleyes ; in third experiment ; AutonoMouseII usestwo (b) format of sensormessages frontal eyesandmicrophone in fourth experiment ; (c) formatof sensormessages ; AutonoMousen usesall four eyesandmicrophone . becausein different experimentsAutono Mouse II neededdifferent sensory capabilities. In the experimentson noisy sensorsand motors, in the lesion studies, and in the experimenton learning to chasea moving light source(sections2.4.1, 2.4.2, and 2.4.3, respectively), the robot can seethe light only with the two frontal eyes; this explains the short input message shown in figure 5.2a. In the first of the two experimentspresentedin section 2.4.4, Autono Mouse II was also able to hear a particular environmental noise, which accountsfor the extra bit in figure 5.2b. Finally , in the secondexperiment presentedin section 2.4.4, using all the sensory capabilities resultedin the longest of the in~ut messages(figure 5.2c). Autono Mouse II moves by activating the two motors, one for the left wheel and one for the right . The available commands for engines are " " " " " " Move one step backward ; Stay still ; Move one step forward ; " " Move two stepsforward. From combinations of thesebasic commands arise forms of movementslike " Advance" ; " Retreat" ; " Rotate" ; " Rotate while advancing," and so on (there are 16 different composite movements ) . The format of messagesto effectorsis the sameas for the simulated Autono Mouse (seefigure 4.2). 5.2.2 ExperimentalMethodology In the real world , experimentswere run until either the goal was achieved or the experimenter was convinced that the robot was not going to achievethe goal (at least in a reasonabletime period) . Experiments with the real robots were repeated only occasionally becausethey took up a great deal of time. When experiments were
ChapterS
repeated, however, differencesbetweendifferent runs were shown to be marginal. The performance index used was the light intensity perceived by the frontal central eye (which increasedwith proximity to the light, up to a maximum level of 256) . To study the relation betweenthe reinforcement program and the actual performance, we plotted the averagereward over intervals of 20 cycles. 5.2.3 The Training Policies As we have seen, given that the environment usedin the simulations was designedto be as close as possible to the real environment, messages ' going from the robot s sensorsto ALECsys and from ALECsys to the ' robot s motors have a structure very similar to that of the simulated Autono Mouse. On the other hand, the real robot differs from the simulated one in that sensorinput and motor output are liable to be inaccurate . This is a major point in machine learning research: simulated environmentscan only approximate real ones. To study the effect that using real sensorsand motors has on the learning process, and its relation to the training procedure, we introduced two different training policies, called " reward-the-intention" policy, and " reward-the-result" policy (seefigure 5.3) .
5.2.3.1 The Reward-tbe-Intention Polley We say that a trainer " rewards the intention" if , to decide what reinforcem to give, it uses observationsfrom an ideal environmentpositioned betweenALECsysand the real Autono Mouse. This environment is said to be " ideal" becausethere is no interferencewith the real world (it is the same type of environment ALECSYS sensesin simulations) . The reward-the-intention trainer rewardsALECsysif it proposesa move that is correct with regard to the input - output mapping the trainer wants to teach (i.e., the reinforcement is computed observing the responsesin the ideal, simulated world) . A reward-the-intention trainer knows the desired input-output mapping and rewards the learning system if it learns that mapping, regardlessof the resulting actions in the real world. 5.2.3.2 The Reward-tbe-Result Pollcy On the other hand, we say that a trainer " rewards the result" if , to decide what reinforcementto give, it usesobservationsfrom the real world . The reward-the-result trainer rewards ALECSYS if it proposes an action that
in theRealWorld Experiments the 8brain8
the . body.
diminishes the distancefrom the goal. ( In the light -approachingexample the rewarded action is a move that causesAutono Mouse II to approach the light source.) A reward-the-result trainer has knowledge about the goal, and rewards ALECsys if the actual move is consistent with the achievementof that goal. With the reward-the-intention policy, the observedbehavior can be the desiredone only if there is a correct mapping betweenthe trainer' s interpretation of sensor(motor ) messagesand the real world significanceof Autono Mouse' s sensoryinput (motor commands) .! For example, if the two Autono Mouse' s eyes are inverted the reward-the-intention trainer will reward ALECSYS for actions that an external observerwould judge as : while the trainer intends to reward the " Turn toward the light " wrong behavior, the learned behavior will be a " Turn away from the light " one. In caseof a reward-the-result trainer, the mapping problem disappears; learns any mapping between sensormessagesand motor messages ALECSYS that maximizesthe reward receivedby the trainer.
100
ChapterS In the following section, we present the results of some experiments ' designed to test ALEcsyss learning capability under the two different reinforcementpolicies. We also introduce the following handicapsto test 's ALE CSys adaptive capabilities: inverted eyes, blindnessin one eye, and incorrect calibration of motors. All graphs presentedare relative to a single typical experiment.
5.2.4 AD ExperimentalStudy of Training Policies
In this section, we investigatewhat effect the choice of a training policy has on noisy sensorsand motors, and on a set of lesions2we deliberately in:@ icted on Autono Mouse II to study the robustnessof learning. The experimentwas set in a room in which a lamp had been arbitrarily positioned . The robot was allowed to wander freely, sensingthe environment with only its two frontal directional eyes. ALEcsvsreceiveda high reward whenever, in the caseof reward-the-result policy, the light intensity perceived by the frontal central sensorincreased, or, in the caseof rewardthe intention policy, the proposedmove was the correct one with respect to the receiv*edsensory input (according to the mapping on which the reward-the-intention trainer basedits evaluations) . Sometimes, especially in the early phasesof the learning process, Autono Mouse II lost contact with the light source (i.e., it ceasedto see the lamp) . In these cases, AL Ecsysreceiveda reward if the robot turned twice in the samedirection. This favored sequencesof turns in the samedirection, assuring that the robot would, sooner or later, seethe light again. In the following experiments , we report and comment on the results of experimentswith the standardAutono Mouse II and with versionsusing noisy or faulty sensors or motors. In all the experiments, ALEcsvs was initialized with a set of randomly generatedclassifiers.
5.2.4.1 Noisy Sensorsand Motors Figure 5.4 comparesthe performanceobtained with the two different reinforcem policies using well-calibrated sensoryor motor devices.3 Note that the perfonnance is better for the reward-the-result policy. Drops in performanceat cycle 230 for the reward-the-intention policy and at cycle 250 for the reward-the-result policy are due to suddenchangesin the light position causedby the experimentermoving the lamp far away from the robot. Figure 5.5 showsthe rewardsreceivedby ALEcsysfor the sametwo runs. Strangely enough, the averagereward is higher (and optimal after 110 cycles) for the reward-the::-intention policy . This indicates that, given
~ " . , ' I ~ / : II ' { A , " ' . ' . , , .050 100 . I , " 5 I 0
in theRealWorld Experiments 250
~(/) 200 c S .= 150 .c .2 ...J 100 50 0
150
200
250
300
Number of cycles
Rewardthe result
----- ---- ---- Rewardthe intention
FiglD'eS.4 Differencein performancebetweenreward-tbe-result and reward-tbe-intention trainingpolicies
pJBM9J
101
300
Number of cycles Reward the result
- - - - - - - - - - - - - Reward the intention
figure 5.5 Differencein averagerewardbetweenreward-the-resultand reward-the-intention trainingpolicies
102
ChapterS
Lt) 250 C \I x 200 '...0 co 150 ~ ~ 100 Q ) 50 g ... Q ) > 0
- in 100 c ) -G 150 c - .c 200 C) :,:j -250
50
' " : I , . , " . ' -1 J '.,,..20 "-'I.~ ! I Y . " , ~ t ,".:tIi".,-'II '".:,a0 II,5 ~ Numberof cycles
Lightintensity
------------- Average reward
5.6 J4igore Systemperformance (light intensity) andaveragereward(multipliedby 2Sto ease visualcomparison ) for reward-the-resulttrainingpolicy perfect information , ALBcsys learns to do the right thing . In the caseof the reward-the-result policy, however, ALECSYS does not always get the expec : ted reward. This is becausewe did not completely succeeddespite our efforts to build good, noise-free sensorsand motors, and to calibrate them appropriately. Figure 5.6 directly compareslight intensity and average reward for the reward-the-result policy (averagereward is multiplied by 25 to easevisual comparison). Theseresultsconfirm our expectation, as discussedin section2.3, that a trainer using the reward-the-intention policy, in order to be a good trainer, needsvery accurate low-level knowledge of the input -output mapping it is teaching and of how to compute this mapping, in order to be able to give the correct reinforcements. Often this is a nonreasonablerequirement becausethis accurate knowledge could be directly used to design the behavioral modules, making learning useless . On the other hand, a reward-the-result trainer only needsto be able to evaluate the moves of the learning systemwith respectto the behavioral pattern it is teaching. It is interesting that the number of cyclesrequired by the real robot to reach the light is significantly lower than the number of cyclesrequired by
Experimentsin the Real World
250 ~
103
200 150 100 50 0
A"".,""'.'iI'-",".",-" .,.,-,.".'A ."'"~ ,'".A I 100 150
200
"I'. ,'.I",\I-,'di ",--~ '--------'-'---'..~' 250300350
Number of cycles Reward the result
- - - - - - - - - - - - - Reward the intention
the simulated robot to reach a high performance in the previous experiments . This is easily explained if you think that the correct behavior is more frequent than the wrong one as soon as performanceis higher than 50 percent. The real robot starts therefore to approach ihe lIght source much before it has reacheda high frequencyof correct moves. 5.2.4.2 LesionsStudy Lesionsdiffer from noise in that they causea sensoror motor to systematically deviate from its designspecifications. In this sectionwe study the robustnessof our systemwhen altered by three kinds of lesions: inverted eyes, one blind eye, and incorrectly regulated motors. Experimentshave shown that for each type of lesion the reward-the-result training policy was able to teach the target behavior, while the reward-the-intention training policy was not. Results of the experimentin which we inverted the two frontal eyesof Autono Mouse II are shown in figures 5.7 and 5.8 (thesegraphs are analogous to thoseregarding the standard eyeconfiguration of figures 5.4 and 5.5) . While graphs obtained for the reward-the-result training policy are qualitatively comparable, graphs obtained for the reward-the-intention training policy differ greatly (seefigures 5.4 and 5.7). As discussedbefore
. , . 5 0500
Chapter 5
pJ8M9J
104
Number of cycles Reward the result
- - - - - - - - - - - - - Reward the intention
I1IUfe 5.8 Difference in averagereward betweenreward-the-result and reward-the-intention training policies for Autono Mouse n with inverted eyes
(seesection 2.3), with inverted eyesthe reward-the-intention policy fails. ( Note that in this experimentthe light sourcewas moved after 150cycles.) Similar qualitative results"were obtained using only one of the frontal directional eyes (one blind eye experiment) . Figure 5.9 shows that the reward-the-intention policy is not capable of letting the robot learn to chase the light source. In this experiment the light was never moved becausethe task was already hard enough. The reward-the-result policy, however, allows the robot to approachthe light, although it requiresmore cyclesthan with two working eyes. (At cycle 135, there is a drop in performan where Autono Mouse II lost sight of the light and turned right until it saw the light sourceagain.) As a final experiment, the robot was given badly regulated motors. In this experiment, bits going to eachmotor had the following new meaning: 00: stay still , 01: one step forward , 10: two stepsforward , 11: four stepsforward.
in the Real World Experiments
250 200 ~
105
150 100 50 0 0
150
200
250
300
Numberof cycles Rewardthe result
- - - - - - - - - - - - - Rewardthe intention
Figme 5.9 Difference in perfonnance between reward-the-result and reward-the-intention training policies I >r Autono Mouse II with one blind eye
The net effect was an Autono Mouse II that could not move backward any more, and that on the average moved much more quickly than before. The result was that it was much easier to lose contact WIth the light source, as happenedwith the reward-the-intention policy after the light was moved (in this experiment the light sourcewas moved after 90 cycles). Becausethe averagespeedof the robot was higher, however, it took fewer cycles to reach the light . Figure 5.10, the result of a typical experiment, shows that the reward-the-result training policy was better also in this casethan the reward-the-intention training policy . 5.2.4.3 Learning to Chasea Moving Light Source Learning to chasea moving light sourceintroduces a major problem; the reinforcementprogram cannot usedistance(or light intensity) changesto decide whether to give a reward or a punishment. Indeed, there are situations in which Autono Mbuse II goestoward the light sourcewhile the light sourcemovesin the sainedirection. If the light sourcespeedis higher than the robot ' s, then the distanceincreaseseven though the robot made the right move. This is not a problem if we usethe reinforcementprogram basedon the intention, but it is if the reinforcementprogram is basedon the result. Nevertheless , it 1Sclear that the reward-the-intention policy is
ChapterS
250 ~
106
200 150 100 50 0
,-" 15 -.,200
Number of cycles
Rewardthe result
----- -------- Rewardthe intention
figure 5.10 Difference in performance between reward-the-result and reward-the-intention training polici~ for Autono Mouse II with incorrect regulation of motors
less appealing becauseit is not adaptive with respect to calibration of sensorsand effectors. A possible solution is to let ALECSYS learn to approach a stationary light, then to freeze the rule set (i.e., stop the learning algorithm) and use it to control the robot. With this procedure one should be aware that , once learning has been stopped, the robot can useonly rules developedduring the learning phase, which must therefore be as complete as possible. To ensurecomplete learning, we need to give Autono Mouse II the opportunity to learn the right behavior in every possible environmental situation (in our case, every possiblerelative position of the light source). Figure 5.11 reports the result of the following experiment . At first the robot is left free to move and to learn the approaching behavior. Mter 600 cycleswe stop the learning algorithm, start to move the light source, and let the robot chaseit . Unfortunately , during the first learning phaseAutono Mouse II does not learn all the relevant rules (in particular, it doesnot learn to turn right when the light is seenby the right eye), and the resulting performance is not satisfactory. At cycle 1,300, we therefore start the learning algorithm again and, during the cycles between1,700 and 2,200, we start to presentthe lamp to Autono Mouse II on one side, and wait until the robot starts to do the right thing . This procedureis repeatedmany times, presentingthe light alternately in front ,
World
Real
the
in Experiments 250 200 >
-
(
)
/
150 c )
Q c -
.
100 c
.
)
. C -
.
50 J
-
0 learning off
on
learning
learning
learning
training
0
500
1000
1500
2000
2500
off
on
107
Number of cycles I1gure 5.11 Autono Mouse II performancein learning to chasemoving light
on the right , on the left, and in back. The exact procedure is difficult to describe, but very easy and natural to implement. What happensduring the experiment, which we find very exciting, is that the reactive system learns in real time. The experimenterhas only to repeat the procedure4 until visual feedbacktells him that ALEcsys has learned, as manifestedby the robot beginning to chasethe light . After this training phase, learning is stopped again (at cycle 2,200) and the light source is steadily moved. Figure 5.11 shows that this time the observedbehavior is much better (performanceis far from maximum, however, becausethe light sourceis moving and the robot therefore never reachesit ) . . 5.2.4.4 Learning to Switch betweenTwo Behaviors In thesetwo experiments, the learning task was made more complicated by requiring Autono Mouse n to learn to change to an alternative behavior in the presenceof a particular noise. Experimentswere run using the reward-the-result policy . In the first experiment, we usedthe two directional frontal eyesand the two behaviors Chase and Escape . During the Chase phasethe experimental setting was the same as in the first experiment with standard capabilities (that is, the experiment in section 2.4.1), except for input , which were one bit longer due to the noise bit (seefigure 5.2b). messages
108
ChapterS
250 200 150 100 50 00" ~~,~3q :'. .900 .,O ~ 600 -50 chase lightescape light chase light
> o X ..--Q _ ~ 0 .C (=Q G ).-Q ~ ) .O )G O (J..> -c I)(Q
Number of cycles Light intensity
-------------
Average revvanj
Figure 5.12 . Autono Mouse II perfonnance in learning to switch between two behaviors: chasinglight and escapinglight (robot usestwo eyes)
When the noise started, the reinforcementprogram changedaccording to the behavior that was to be taught, that is, escapingfrom the light . Again, as can be seenfrom figure 5.12, the observedbehavior was exactly the desiredone. In the secondexperiment, we used four directional eyes and the two behaviors Chase moving forward and Chase moving backward. During the Chase moving forward phase, the experimentalsetting was the same as in the first experiment with standard capabilities (that is, the experiment in section 2.4.1), except for input messages , which were three bits longer due to the noise bit and to the two extra bits for the two rear eyes (see figure 5.2c) . When the noise started, the reinforcement program changed according to the behavior that was to be taught. From figure 5.13, it is apparent that after the noise event (at cycle 152) performance fell very quickly ; 5 the robot had to make a 180- degreeturn before it could again seethe maximum intensity of light from the light source. As after the turn , ALECSYS did not have the right rules (rememberthat the rules developedin the moving forward phasewere different from the rules necessaryduring the moving backward phasebecauseof the bit that
109
in the RealWorld Experiments 250 200 ->U) -; 150 c - 100 .c. a -J 50 0 0 14
50 moving
100 forward
150 ~
200
250
moving
backward
4
300 ~
indicated presenceor absenceof noise), Autono Mouse II was in a state very similar to the one it was in at the beginning of the experiment. Therefore, for a while it moved randomly and its distancefrom the light source could increase(as happenedin the reported experiment). At the end, it again learnedthe correct set of rules and, as can be seenin the last part of the graph, started to approach the light sourcemoving backward. These two experimentshave shown that increasing the complexity of the task did not causethe performance of the system to drop to unacceptable levels. Still , the increasein complexity was very small, and further investigation will therefore be necessaryto understandthe behavior of the real Autono Mouse in environmentslike those used in simulations.
WITH AUTONOMOUSE IV 5.3 EXPERIMENTS 5.3.1 AutonoMo. . IV Hardware Autono Mouse IV (seefigure 5.14) has two directional eyes, a sonar, front and side bumpers, and two motors. Each directional eye can sensea light sourcewithin a cone of about 180degrees.The two eyestogether cover a 270- degreezone, with an overlapping of 90 degreesin front of the robot. The sonar is highly directional and can sensean object as far away as 10
110
ChapterS .' left visual
, ,
eyes
,
tracks
one sonar
beam
~
~
~
~ .
c .
c
~.
.
t .
.
t
~
~
~ , , , , , , ,
sonar
bumpers
right visual cone " ,
figure 5.14 Functional schematicof Autono MouseIV
meters. The output of the sonar can assumetwo values, either " I sensean " " " object, or I do not sensean object. Each motor can stay still or move the connected track one or two steps forward , or one step backward. Autono Mouse IV is connectedto a transputer board on a PC via a 4,800baud infrared link . 5.3.2 ExperimentalSettings The aim of the experimentspresentedin this section was to compare the results obtained in simulation with those obtained with the real robot. The task chosenwas Find Hidden , already discussedin section 4.5. The environment in which we ran the real robot experimentsconsistedof a large room containing an opaque wall , about 50 cm x 50 cm, and an ' ordinary lamp (50 W ). The wall s surfacewas pleated, in order to reflect ' the sonar s beam from a wide range of directions. The input and output interfaceswere exactly the sameas in the simulation, and so was the RP. The input from the sonar was defined in such a way that a front obstacle could be detectedwithin about 1.5 m from the robot. The first experiment presents the results obtained on the complete task, while the second experiment discusses results on the relation between sensor granularity and learning speed. Although we designedthe simulated environment and robot so as to make them very similar to their real counterparts, there were three main differencesbetweenthe real and the simulated experiments. First , in the real environment the light was moved by hand and hidden behind the wall when Autono Mouse IV got very close to it ( 10- 15 cm), as compared to the simulated environment, where the light was moved by the
Experimentsin the Real World
simulation program in a systematic way. The manual procedure introduced an element of irregularity, although, from the results of the experiments, it is not clear whether this irregularity affected the learning process. Second, the distancesof Autono Mouse IV from the light and from the wall were estimated on the basis of the outputs of the light sensorsand of the sonar, respectively. More precisely, each of the two eyesand the sonar output an eight-bit number from 0 to 255, which coded, respec tively, the light intensity and the distance from an obstacle. To estimate whether the robot got closer to the light , the total light intensity (that is, the sum of the outputs of both eyes) at cycle t was compared with the total light intensity at cycle t - 1. The sonar' s output was usedin a similar way to estimatewhether the robot got closer to the wall. The eyesand the sonar were usedin different ways by the agent and by the RP: from the point of view of the agent, all thesesensorsbehavedas on/ off devices; for the RP, the eyes and the sonar produced an output with higher discriminative power. Therefore, the samehardware devices were used as tHe trainer' s sensorsand, through a transformation of their outputs, as the sensorsof the agent (the main reason for providing the agent with a simplified binary input was to reducethe sizeof the learning ' systems searchspace, thus speedingup learning) . By exploiting the full binary output of the eyesand of the sonar, it is possibleto estimate the actual effect of a movement toward the light or the wall. However, given the sensoryapparatus of Autono Mouse IV , we cannot establishthe real effect of a left or right turn ; for thesecases, the RP basedits reward on the expectedmove, that is, on the output message sent to the effectors, and not on the actual move. In other words, we used a reward-the-intention reinforcementpolicy . To overcomethis limitation , we have addedto Autono Mouse IV a tail which can senserotations of the robot. The resulting robot , which we call " Autono Mouse y ," will be the object of experimentationin chapter 7. Third and finally , for practical considerations, the experiments with Autono Mouse IV were run for about four hours, covering only 5,000 cycles, while the simulated experimentswere run for 50,000 cycles.
5.3.3 Find Bidden Task: ExperimentalResults The graph of figureS .lS showsthat Autono Mouse IV learned the target behavior reasonably well, as was intuitively clear by direct observation during the experiment. There is, however, a principal discrepancybetween
ChapterS
--Q . 0.9 )Q Q > )
90ue
112
0.8 0.7 0.6
2
- - - - -
Chase
3
4
!
Number of cycles ( thousands) - - - - Search FindHidden
F11m ' e 5.15 Find Hidden experimentwith AutonoMouseIV the results of the real and those of the simulated experiments (see figure 4.20) . In the simulation , after 5,000 cycles the Chase , Search , and Find Hidden performances had reached , respectively , the approximate values of 0.92, 0.76, and 0.81; the three corresponding values in the real experiment are lower (about 0.75) and very close to each other . To put it differently , although the real and the simulated light searching perform ances are very similar , in the simulated experiment the chasing behavior is much more effective than the searching behavior , while in the real . experiment they are about the same. We interpret this discrepancy between the real and simulated experiments as an effect of the different ways in which the distance between the and the light were estimated . In fact , the total light intensity did robot not allow for a very accurate discrimination of such a distance. A move toward the light often did not result in an increase of total light intensity large enough to be detected; therefore , a correct move was not rewarded , because the RP did not understand that the robot had gotten closer to the light . As a consequence, the rewards given by the RP with respect to the light - chasing behavior were not as consistent as in the simulated experiments .
113
in theRealWarld Experiments
-a.. 1 - - - - - - - - -.4 - 0.8 gr ....-- / .!~ 0.6 rl ~ --------,-,---g!~~~ , , ' , c 0.4 / . ' , J ' E so gr.2 - 0.2 a~ .. l 0 500
1000
1500
2000
2500
Number of cycles F1awe5.16 Sensorvirt11ali?~tion with Autono Mouse IV
. Table5.1 whenchanging of realandsimulatedAutonoMouseIV perfonnances Comparison sensor 's granularity
Sonar granularity
Performanceof simulated robot after 5,000 cycles
Performanceof real robot after 5,000 cycles
2
0 .456
0 .378
4
0 .926
0 .906
64
0 .770
0 .496
5.3.4 SensorGranularity: ExperimentalResults To further explore the relation between sonar sensor granularity and learning performance, we repeatedwith the real robot the same experiment previously run in simulation (see section 4.5.2). We set the granularity of the real robot sonar sensorto 2, 4, and 64 (seefigure 5.16) . As was the case in the simulation experiment, the best performance was obtained with granularity 4, that is, when the granularity matched the number of levels used by the trainer to provide reinforcements. It is interesting to note (seetable 5.1) that the overall performance level was lower in the real robot case. This is due to the difference between the simulated and the real sonar: the real sonar was able to detect a distant
114
ChapterS obstacle only if the sonar beam hit the obstacle with an angle very close to 90 degrees.
5.4 POINTSTO REMEMBER . Results obtained in simulation carry over smoothly to our real robots, which suggeststhat we can use simulation to develop the control systems and fine-tune them on real robots. . The kind of data used by the trainer can greatly influence the learning . a trivial point , we capabilities of the learning system. While this can seem have shown how two reasonableways to use data on a robot ' s behavior determine a different robustnessof the learning system to the robot ' s maIfuncti oDing. . As was the case in simulation experiments, sensor granularity is at best when it matchesthe number of levels used by the trainer to provide . reinforcements.
Chapter Beyond
6 Reacdve
Behavior
6.1 I NTRODUCfI ON
In the experimentspresentedin the two previous chapters, our robots were viewed as reactive systems, that is, as systemswhose actions are completely determined by current sensory input . Although our results show that reactive systemscan learn to carry out fairly complex tasks, . there are interesting behavioral patterns not determined by current perceptions alone that they simply cannot learn. In one important class of nonreactivetasks, sequentialbehaviorpatterns, the decisionof what action to perform at time t is influencedby the actions performed in the past. This chapter presentsan approach to the problem of learning sequential behavior patterns based on the coordination of separate, previously learned basic behaviors. Section 6.2 defines some technical terminology and situatesthe problem of sequentialbehavior in the context of a general approach to the developmentof autonomous agents. Section 6.3 defines the agents, the environment, and the behavior patterns on which we carried out our experimentation, and specifiesour experimental plan. Section 6.4 describesthe two different training strategieswe used in our experiments, while section 6.5 reports someof our results.
6.2 REACTIVEAND DYNAMICREBAVIORS As the simplest class of agents, reactive systemsare agents that react to their current perceptions( Wilson 1990; Littman 1993) . In a reactive " Behavior is based onthearticle"TrainingAgentsto Perform Thischapter by Sequential in Behavior 2 3 1994 The which M. Colombetti andM. Dorigo , ( ), @ appeared Adaptive MIT Press .
116
Chapter 6 system, the action a ( t) produced at time t is a function of the sensory input
s( t) at time t: a(t) = / (s( t . As argued by Whitehead and Lin ( 1995), reactive systemsare perfectly adequateto Markov environments , more specifically, when . the greedy control strategy is globally optimal, that is, choosing the locally optimal action in eachenvironmental situation leadsto a courseof actions that is globally optimal; and . the agent has knowledgeabout both the effectsand the costs (gains) of each possibleaction in eachpossibleenvironmental situation. Although fairly complex behaviors can be carried out in Markov environments, very often an agent cannot be assumedto have complete knowledgeabout the effectsor the costsof its own actions. Non -Markov situations are basically of two different types:
1. Hidden state environments . A hiddenstate is a part of the environmental situation dot accessibleto the agent but relevant to the effectsor to the costs of actions. H the environment includes hidden states, a reactive agent cannot selectan optimal action; for example, a reactive agent cannot choosean optimal movementto reach an object that it doesnot see. 2. Sequentialbehaviors. Supposethat at time t an agent has to choosean action as a function of the action performed at time t - 1. A reactive agent can perform an optimal choice only if the action performed at time t - 1 has somecharacteristicand observableeffect at time t , that is, only if the agent can infer which action it performed at time t - 1 by inspecting the environment at time t. For example, supposethat the agent has to put an object in a given position, and then to removethe object. H the agent is able to perceivethat the object is in the given position, it will be able to appropriately sequencethe placing and the removing actions. On the other hand, a reactive agent will not be able to act properly at time t if (i) the effectsof the action performed at time t - 1 cannot be perceivedby the agent at time t (this is a subcaseof the hidden state problem); or (ii ) no effect of the action performed at time t - 1 persists in the environmen at time t. To develop an agent able to deal with non-Markov environments, one must go beyond the simple reactive model. We say that an agent is " dynamic" if the action a(t) it perfonns at tilDe t dependsnot only on its
117
BeyondReactiveBehavior current sensoryinput s( t) but also on its state x( t) at time t ; in turn , such state and the current sensoryinput determinethe state at time t + 1:
a(t) = f (s( t), x( t ; x( t + 1) = g(s(t), x( t . In this way, the current action can dependon the past history. An agent' s states can be called " internal states," to distinguish them from the statesof the environment. They are often regardedas memories of the agent' s past or as representationsof the environment. In spite (or because ) of their rather intuitive meaning, however, terms like memory or representationcan easily be used in a confusing way. Take the packing task proposed by Lin and Mitchell ( 1992, p. 1) as an example of nonMarkov problem: : opena box, put a gift into it, [ . . . ] Considera packingtaskwhichinvolves4 steps closeit , and sealit. An agentdrivenonly by its currentvisualperceptscannot whenfacinga closedbox the agentdoesnot know accomplishthis task, because if a gift is alreadyin the box andthereforecannotdecidewhetherto sealor open . the box. It seemsthat the agent needsto rememberthat it has already put a gift into the box. In fact, the agentmust be able to assumeone of two distinct internal states, say 0 and 1, so that its controller can choose different actions when the agent is facing a closed box. We can associatestate 0 " " with " The box is empty," and state 1 with The gift is in the box. Clearly, the statemust switch from 0 to 1 when the agentputs the gift into the box. But now, the agent' s state can be regarded: " 1. as a memoryof the past action " Put the gift into the box; or " 2. as a representationof the hidden environmental state The gift is in the " box. Probably, the choice of one of theseviews is a matter of personaltaste. But consider a different problem. There are two distinct objects in the environment, say A and B. The agent has to reach A , touch it , then reach B, touch it , then reachA again, touch it , and so on. In this case, provided that touching an object doesnot leave any trace on it , there is no hidden state in the environment to distinguish the situations in which the agent should reach A from those in which it should reach B. We say that this " environment is " forgetful in that it does not keep track of the past actions of the agent. Again , the agent must be able to assumetwo distinct internal states, 0 and 1, so that its task is to reach A when in state 0, and
118
Chapter 6
to reach B when in state 1. Such internal statescannot be viewed as represent of hidden statesbecausethere are no such statesin this environme . However, we still have two possibleinterpretations: 1. Internal statesare memoriesof past actions: 0 means that B has just beentouched, and 1 meansthat A hasjust beentouched; or 2. Internal statesare goals, determining the agent' s current task: state 0 meansthat the current task is to reach A , state 1 meansthat the current task is to reach B. The conclusion we draw is that terms like memory, representation , and goal, which are very commonly used for examplein artificial intelligence, often involve a subjectiveinterpretation of what is going on in an artificial agent. The term internal state, borrowed from systemstheory, seemsto be neutral in this respect, and it describesmore faithfully what is actually going on in the agent. In this chapter, we will be concernedwith internal statesthat keep track of past actions, so that the agent' s behavior can follow a sequentialpattern. In particularI we are interestedin dynamic agentspossessinginternal states by design, and learning to usethem to produce sequentialbehavior. Then, if we interpret internal statesas goals, this amounts to learning an action plan, able to enforce the correct sequencingof actions, although this intuitive idea must be taken with somecare. Let us observethat not all behavior patterns that at first sight appear to be basedon an action plan are necessarilydynamic. Consider an example of hoarding behavior: an agent leavesits nest, chasesand graspsa prey, brings it to its nest, goesout for a new prey, and so on. This sequential behavior can be produced by a reactive system, whose stimulus-response associationsare describedby the following production rules (where only the most specificproduction whose conditions are satisfiedis assumedto fire at eachcycle) : Rule 1: - + move randomly. Rule 2: not grasped& prey ahead - + move ahead. Rule 3: not grasped& prey at contact - + grasp. Rule 4: grasped& nest ahead - + move ahead. Rule 5: grasped& in nest - + drop. In fact, the experimentsreported in chapters4 and 5 show that a reactive can easily learn to perform tasks similar agent implementedwith ALECSYS to the one above.
119
Beyond Reactive Behavior
It is interesting to seewhy the behavior pattern describedabove, while merely reactive, appearsassequentialto an external observer. If , insteadof the agent' s behavior, we considerthe behavior of the global dynamic system constitutedby the agentand the environment, the task is actually dynamic. The relevant statesare the statesof the environment, which keepstrack of the effectsof the agent' s moves; for example, the effectsof a graspingaction are stored by the environment in the form of a graspedprey, which can then be perceivedby the agent. We call pseudosequences tasks performed by a reactive agent that are sequentialin virtue of the dynamic nature of the environment, and use the term proper sequences for tasks that can be executedonly by dynamic agents, in virtue of their internal states. Let us look at theseconsiderationsfrom another perspective. As it has already been suggestedin the literature (see, for example, Rosenschein and Kaelbling 1986; Beer 1995), the agent and the environment can be viewed as a global system, made up of two coupled subsystems . For the interactions of the two subsystemsto be sequential, at least one of them must be properly dynamic in the sensethat its actions dependon the sub' systems state. The two subsystemsare not equivalent, however. Even though the agent can be shapedto produce a given target behavior, the dynamicsof the environment are taken as given and cannot be trained. It is therefore interesting to see whether the only subsystemthat can be trained, namely, the agent, can contribute to a sequentialinteraction with statesof its own: this is what we have called a " proper sequence ." In this chapter, instead of experimenting with sequencesof single actions, we have focusedon tasksmade up of a sequenceof phases, where a phase is a subtask that may involve an arbitrary number of single actions. Again, the problem to be solved is, how can we train an agent to switch from the current phaseto the next one on the basisof both current sensoryinput and knowledgeof the current phase? One important thing to be decidedis when the phasetransition should occur. The most obvious assumption is that a transition signal is produced by the trainer or by the environment, and is perceivedby the agent. Clearly, if we want to experiment on the agent' s capacity to produce proper behavioral sequences , the transition signal must not itself convey information about which should be the next phase.
6.3 EXPERIMENTAL SETTINGS This sectionpresentstheenvironments , theagent, andthetargetbehavior for our experiments .
120
Chapter 6
...
,
- - ,
, ,
I , \ ,
\ ,
0 I "
A , - - '"
~
-
'
'
I ~
( :
,
, '
) ~ - - ...
}1aure6.1 Forgetful environment for SeQuentialbehavior
6.3.1 The Simulation Environment What is a g60d experimental setting to show that proper sequences can emerge? Clearly, one in which ( 1) agent- environmentinteractions are sequential, and (2) the sequentialnature of the interactions is not due to states of the environment. Indeed, under these conditions we have the guaranteethat the relevant states are those of the agent. Therefore, we carried out our initial experiments on sequential behavior in forgetful environments- environments that keep no track of the effects of the ' agent s move. Our experimental environment is basically an empty space containing two objects, A and B (figure 6.1) . The objects lie on a bidimensional plane in which the agent can move freely; the distance between them is approximately 100 forward stepsof the agent. In some of the experiments , both objects emit a signal when the agent entersa circular area of predefinedradius around the object (shown by the dashedcircles in the figure) . 6.3.2 De Agent' s " Body" The Autono Mouse usedin theseexperimentsis very similar to simulated Autono Mice usedfor experimentsin chapter 4. Its sensorsare two on/ off eyeswith limited visual field of 180 degreesand an on/ ofT microphone. The eyesare able to detect the presenceof an object in their visual fields, and can discriminate betweenthe two objectsA and B. The visual fields of
121
BeyondReactiveBehavior the two eyesoverlap by 90 degrees , so that the total angle coveredby the two eyesis 270 degrees, partitioned into three areas of 90 degreeseach ' (seefigure 6.1). The Autono Mouse s effectorsare two independentwheels that can either stay still , move forward one or two steps, or move backward one step.
6.3.3 Target Behavior The target behavior is the following : the Autono Mouse has to approach object A , then approach object B, then approach object A , and so on. . This target sequencecan be representedby the regular expression{ IXP } , where IXdenotesthe behavioral phasein which the Autono Mouse has to approach object A , and P the behavioral phase in which the Autono Mouse has to approach object B. We assumethat the transition from one phaseto the next occurs when the Autono Mouse sensesa transition signal . This signal tells the Autono Mouse that it is time to switch to the next phase, but doesnot tell it which phaseshould be the next. Concerning. the production of the transition signal, there are basically two possibilities: 1. External-basedtransition, where the transition signal is produced by an external source(e.g., the trainer) independentlyof the current interaction betweenthe agent and its environment; or 2. Result-basedtransition, where the transition signal is produced when a ' given situation occurs as a result of the agent s behavior; for example, a transition signal is generatedwhen the Autono Mouse has come close enough to an object. The choice between these two variants correspondsto two different intuitive conceptions of the overall task. If we choose external-based transitions, what we actually want from the Autono Mouse is that it learn to switch phaseeach time we tell it to. If , instead, we chooseresult-based transitions, we want the Autono Mouse to achievea given result, and then to switch to the next phase. Supposethat the transition signal is generated when the agent reachesa given threshold distancefrom A or from B. This meansthat we want the agent to reach object A , then to reach object B, and so on. As we shall see, the different conceptionsof the task underlying this choice in:8uencethe way in which the Autono Mouse can be trained. It would be easyto turn the environmentdescribedaboveinto a Markov environment so that a reactive agent could learn the target behavior. For example, we could assumethat A and B are two lights, alternatively
122
Chapter 6
switched on and off, exactly one light being on at each moment. In this case, a reactive Autono Mouse could learn to approach the only visible light , and a pseudosequentialbehavior would emergeas an effect of the dynamic nature of the environment.
6.3.4 De Agent' s Controller and SensorimotorInterfaces
. For the { (XP } behavior, we useda two-level hierarchical architecture(see chapter 3) that was organizedas follows. Basic modules consistedof two independentLCSs, LCS and LCSp, in charge of learning the two basic behaviors (Xand p . The coordinator consistedof one LCS, in charge of learning the sequentialcoordination of the lower-level modules. The input of each basic module representsthe relative direction in ' which the relevant object is perceived. Given that the Autono Mouse s eyes partition the environment into four angular areas, both modules have a two-bit sensoryword as input . At any cycle, each basic module proposes a motor action, which is representedby four bits coding the movement of each independentwheel (two bits code the four possible movementsor the left wheel, and two bits those of the right wheel) . Coordination is achievedby choosing for execution exactly one of the actions proposedby the lower-level modules. This choice is basedon the value of a one-bit word, which representsthe internal state of the agent, " " and which we thereforecall the state word. The effect of the stateword is hardwired: when its value is 0, the action proposedby LCS is executed; when the value is 1, it is LCSp that wins. The coordinator receivesas input the current value of the state word and one bit representingthe state of the transition signal sensor; this bit is set to 1 at the rising edgeof the transition signal and to 0 otherwise. The possible actions for the coordinator are ( 1) set the state word to 0, and (2) set the state word to 1. The task that the coordinator has to learn is " " Maintain the same phase if no transition signal is perceived, and " each time a transition " Switch signal is perceived. phase The controller architecture, describedin figure 6.2, is a dynamic system. At eachcycle t , the sensoryinput at t and the value of the state word at t jointly determineboth the action performed at t and the value of the state word at t + 1.
6.3.5 Experimental Methodology For eachexperimentreportedin this chapterwe ran twelveindependent trials, startingfrom randominitial conditions.Eachtrial included
123
Beyond ReactiveBehavior
Figure6.2 Controllerarchitectureof Autono Mouse . a basiclearningsessionof 4,000 cycles, in which the two basicbehaviors and P were learned ; . . a coordinator learning sessionof 12,000 cycles, in which learning of basic behaviorswas switchedoff and only the coordinator was allowed to learn; and . a testsessionof 4,000 cycles, where all learning was switchedoff and the perfonnanceof the agent evaluated. ' In the learning sessions , the agent s perfonnance Plcarn (t) at cycle twas computed for each trial as Plearn( t ) =
Number
of correct actions performed t
from cycle I to cycle t
,
where an action was considered as correct if it was positively reinforced . " " We called the graph of Plearn(t) for a single trial a learning curve . In the ' test session, the agent s performance Ptestwas measured for each trial as a single number : Number of correct actions performed in the test session . Ptest= 4 ,000 For each experiment we show the coordinator learning curve of a single, typical trial , and report the mean and standard deviation of the twelve Ptestvalues for ( 1) the two basic behaviors < and 1>; and (2) the " " " " ' and switch ) . It is important to two coordinator s tasks ( maintain remark that the performance of the coordinator was judged from the
124
6 Chapter overall behavior of the agent. That is, the only information available to evaluatesuchperformancewas whether the agent was actually approaching A or approaching B; no direct accessto the coordinator' s state word was allowed. Instead, to evaluatethe performanceof the basic behaviors, it was also necessaryto know at each cycle whether the action performed was suggestedby LCS or by LCSp; this fact was establishedby directly inspectingthe internal state of the agent. Finally , to establish whether different experiments resulted in significantly different performances, we computed the probability , p, that the of sample performancesproducedby the different experimentsweredrawn from the samepopulation. To computep, we usedthe Mann -Whitney test (seesection4.1.1) .
6.4 TRAININGPOLI~ As we said in chapter 1, we view training by reinforcement learning as a mechanism to translate a specification of the agent' s target behavior, , embodiedin the reinforcementpolicy, into a control program that realizes the behavior. As the learning mechanismcarries out the translation in the context of agent-environment interactions, the resulting control program can be highly sensitiveto featuresof the environment that would be difficult to model explicitly in a handwritten control program. As usual, our trainer was implementedas an RP, which embodied the specificationof the target behavior. We believe that an important quality an RP should possessis to be highly agent-independent . In other words, we want the RP to baseits judgments on high-level featuresof the agent' s behavior, without bothering too much about the details of suchbehavior. In particular, we want the RP to be as independent as possible from internal features of the agent, which are unobservable to an external observer. Let us considerthe { aP} * behavior, where a = approach object A ; p = approach object B. The transitions from a to p and from p to a should occur whenever a transition signal is perceived. The first step is to train the Autono Mouse to perform the two basic behaviorsa and p . This is a fairly easytask, given that the basicbehaviors are instancesof approaching responsesthat can be produced by a simple
125
Beyond Reactive Behavior
reactive agent. After the basic behaviors have been learned, the next ' step is to train the Autono Mouse s coordinator to generate the target . Before doing so, we have to decidehow the transition signal is sequence to be generated. We experimentedwith both external-based and resultbasedtransitions.
6.4.1 External-BasedTransitions External-based transition signals are generated by the trainer. Let us assumethat coordinator training starts with phase~. The trainer rewards the Autono Mouse if it approaches object A , and punishes it otherwise. At ran<:iom intervals, the trainer generatesa transition signal, which lasts one cycle. After the :first transition signal is generated, the Autono Mouse is rewarded if it approaches object B , and punished otherwise, and so on. We can representthis strategy as a set of rules, shown in table 6.1. Columns 2 and 5 report the " correct" phase (i .e., the phase from the trainer' s point of view) at timest - 1 and t. Clearly, the correct phaseat t is the sameas the correct phaseat t - 1, when the transition signal at t - 1 (reported in column 4) is OFF , and it is the opposite phase when the transition signal at t - 1 is ON . Columns 3 and 6 report the phasefrom the point of view of the Autono Mouse. Note that the reinforcementgiven at t (column 7) dependsonly on a comparison betweenthe trainer' s and the Autono Mouse' s points of view at time t , independently of the transition signal at t - 1. In other words, reinforcements can be given according to the much simpler table 6.2. However, this strategy is going to lead us into troubles. To understand ' why, remember that the Autono Mouse cannot read the trainer s mind , and that it can only try to correlate the reinforcement receivedwith its own behavior and perception of the transition signal. If we extract from table 6.1 only the information available to the Autono Mouse, we obtain table 6.3. Unfortunately, this table is incoherent: the upper and the lower halvesof the table are oppositewith respectto the reinforcements. To give a concreteexample, let us supposethat in phase~ the Autono Mouse is correctly approaching A , but does not change behavior in presenceof a transition signal (i.e., it makes an error) . Given that the Autono Mouse continuesto approach A while it should now approach B , the trainer will give a punishment, applying rule 3 of table 6.1. But then, supposethat the Autono Mouse goes on approaching A . What should the trainer do? A rigid reinforcementprogram ( RPrig) would continue to punish the Autono Mouse, applying rule 9 of table 6.1. Unfortunately ,
126
Chapter 6
Table6.1 ' Rigid RP: Trainer's strategyfor reinforcingAutonoMouses behaviorat time t
Rule no.
Phaseat 1- 1 according to trainer
1 2 3 4 5 6 7 8
~ ~ ~ ~ ~ ~ ~ ~
9 1.0 11 12 13 14 15 16
Phaseat t- l according to Autono Mouse
~ ~ ~ ~ P P P P ~ ~ ~ ~
p p p p P P p p
'
P P P P
Phaseat t Phaseat t according according to Autonoto trainer Mouse
Transition signal at t- l
OFF OFF ON ON OFF OFF ON ON
-+ -+ -+ -+ -+ -+ -+ -+
OFF OFF ON ON OFF OFF ON ON
-+ -+ -+ -+ -+ -+ -+ -+
~ ~ p p ~ ~ p p
Reinforcement at t
~ p ~
-
p ~
+ + -
-
p ~
+ +
p ~
p p ~ ~
p ~
p p ~ ~
p ~
+ + + -
p ~
+ + +
p
Table6.2 ' ' Riald - RP: Trainers strategyactuallydependsonly on AutonoMouse s
present
behavior Phaseat t according to trainer
Phaseat t according to Autono Mouse
IX IX
p
P P
p
Reinforcementat t
-
127
Beyond Reactive Behavior Table 6.3 ' ' Rigid RP: From Autono Mouse s point of view, trainer s strategy appears as incoherent Phaseat t according to Autono Mouse
Rule no.
1 2 3 4 5 6 7 8 10 11 12 13 14 15 16
p p p -p ~ ~ ~ ~ p p p p
OFF OFF ON ON OFF OFF ON ON OFF OFF ON ON OFF OFF ON ON
- + - + - + - + - + - + - +
Reinforcement at t
a p a p a p a
-
+ +
-
- +
p
- +
a
-
p a
+
- + - +
p a - + - +
+
-
p a
+
p
-
+
suchtraining strategyis seenas incoherentby the Autono Mouse, which is simply maintaining the samebehavior in absenceof a transition signal, a conduct that is normally rewarded. Under theseconditions, it would be impossible for the Autono Mouse to learn any meaningful policy . This inability has been observedexperimentally. We have therefore applied what we call a " flexible reinforcement program " ), that is, after the Autono Mouse makes an error, the (RPocx trainer changesphase. It is the trainer who adapts to the Autono Mouse behavior applying a flexible training method. In this way there will never be incoherencein the trainer behavior: in the previous case, after the Autono Mouse has applied the incorrect rule 3 of table 6.1, the trainer goesback to phaseIX. If now the Autono Mouse goeson approaching A , then the situation describedby rule I of table 6.1 will be applied, which
128
Chapter 6
now resultsin a positive reinforcement. In other words, the trainer' s conduct with RPoexcan be describedas follows: Start in phasea; . When in phase a, reward the Autono Mouse if it approaches punish it otherwise; when in phase Preward the Autono Mouse if it .
esB, andpunishit otherwise ; approach . Changephasewhena transitionsignalis produced ; . If the AutonoMouseappearsnot to changebehaviorin presence of a transitionsignal, punishit and restorethe previousphase ; and . If the AutonoMouseappearsto changebehaviorin absence of a transition it and . signal, punish changephase Therationaleof this reinforcement programis that the trainerpunishes an inadequate treatmentof thetransitionsignal, but rewardscoherency of behavior. Experiments1, 2, and 3 (next section have been run ) using . RPoex 6.4.2
Result- Based Transitio18
Let us now supposethat the target sequentialbehavior is understood as follows: the agent should approach and reach object A , then approach and reach object B, and so on. A major difference with respect to the previous caseis that a transition signal is now generatedeach time the agent comes close enough to an object (see the dashed circles in figure 6.1). If we apply the rigid reinforcementprogram, the Autono Mouse does not learn the target behavior, for the reasonsalready pointed out in section 6.4.1. But in this case, it no longer makes sensefor the trainer to flexibly changephasewhen the agent switches behavior: a phaseis completed only when a given result is achieved, that is, when the relevant is object reached. It is easyto seethat incoherent situations can be eliminated by adding an extra sensorto the Autono Mouse that records information about the sign of the reinforcement received. This extra sensor, which we call a " reinforcementsensor" is a one-bit field in the , sensoryinterface that tells the Autono Mouse whether the previous action had been rewarded or punished. Table 6.4 showsthe rationale of RPrig, including the reinforcement given at t - 1. Table 6.S reports the information available to the Autono Mouse in presenceof a reinforcementsensor, assumingthis is set to 1 when an action is rewarded, and to 0 when an action is punished. Contrary to table 6.3, table 6.S is no longer incoherent, in that the corre-
9
to
~ ~ 9 ~ L ~ 8 ~ ~~ ~~ ~~ ~ d d d d
G ) -a ~8.
d tl d 91 d d d ' d d ~ d
~
[
+
+
+
-
-
+
+
+
+
+ ddO ddO 9 ddO 9 NO ~ ~ ~NO ~ 9 9 NONOddO
-
-
+
9 NO 9 ~ ~ 9 9 ddO ddO ~ddO ~NO NONOddO
~ d
9
I
I O
~
+
+
+
-
-
-
-
-
.
I
-
-~
Beyond Reactive Behavior
130
Rule no.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Chapter 6
Phaseat t - 1 according to Autono Mouse
Reinforcement sensorat t - 1
Transition signal at t- 1
p p p p
1 1 1 1 0 0 0 0
OFF OFF ON ON OFF OFF ON ON
0 0 0 0 I I I I
OFF OFF ON ON OFF OFF ON ON
a a a a p p p p
Phaseat t according to Autono Mouse - + - + - + - + - + - +
Reinforcement at t
IX P IX P IX P IX
-
+ +
-
P p p
-
-
p
+ +
p
-
spondinglinesof its upperandlowerhalvesarenow distinguished by the stateof the reinforcement sensor . Experiments4, 5, and 6 (next section ) showthat the AutonoMouseis ableto learnthe targetbehaviorwhentrainedwith theRPrigif its sensory interfaceincludesthe reinforcementsensor . We think that the notion of reinforcement sensoris not trivial and thereforeneedsto be discussed in somedetail. 6.4.3 Meaning and Use of the ReiDforcementSensor At each moment, the reinforcement sensor stores information about what happenedin the previous cycle; as such, it contributes to the agent' s dynamic behavior. Its characteristic feature is that it stores information about the behavior of the trainer, not of the physical environment. It may seemthat suchinformation is available to the Autono Mouse evenwithout
Beyond Reactive Behavior
the reinforcementsensorbecauseit is receivedand processedby the credit . The point is that no information apportionment module of ALECSYS to the Autono Mouse' s controller unless is available about reinforcement it is coded into the sensoryinterface. In a sense , an agent endowed with the reinforcement sensornot only receivesreinforcementsbut also perceives them. It is interesting to seehow the information stored by the reinforcement sensoris exploited by the learning process. When trained with the RPrig, the Autono Mouse will develop behavior rules that , when the reinforcement sensor is set to 0, change phase in absenceof a transition signal 5 and 10 of tables6.4 and 6.5), and do not changephasein presence rules ( of a transition signal (rules 8 and 11 of tables 6.4 and 6.5) . Suchrules can be viewed as e" or recovery rules in that they tell the agent what to do in order to fix a previous error in phase sequencing. Rules matching " messageswith the reinforcement sensorset to 1 will be called normal " rules, to distinguish them from error recovery rules. Without a reinforcement sensor, punishments are exploited by the . systemonly to decreasethe strengthof a rule that leadsto an error (i.e., to an incorrect action) . With the reinforcementsensor, punishmentsare used for one extra purpose, to enableerror recovery rules at the next cycle. In general, as learning proceeds, fewer and fewer errors are made by the Autono Mouse, and the error recovery rules becomeincreasinglyweaker, so that sooneror later they are removedby the geneticalgorithm . Error recovery rules presupposea reinforcement, and thus can be used only as far as the trainer is on. If the trainer is switched off to test the acquired behavior, the reinforcementsensormust be clamped to 1 so that normal rules can be activated. This meansthat error recovery rules will remain silent after the trainer is switched off; it is therefore advisable to do so only after all recovery rules have been eliminated by the genetic algorithm . In the experimentsreported in this chapter, error recovery rules were either eliminated before we switched off the trainer, or they became so weak that they were practically no longer activated. With more complex tasks, however, one can easily imagine situations where some error recovery rules could maintain a strength high enough to survive and to contribute to the final behavior, and where switching off the trainer would actually impoverish the final performance. On the other hand, one could still switch off the learning algorithm: the use of error recovery rules presuppose " " that an external systemgives positive or negative judgments
132
Chapter 6
about the Autono Mouse' s actions but does not require the learning algorithm to be active. After switching off learning, the trainer actually turns into an advisor, that is, an external observerin charge of telling the agent, which is no longer learning anything, whether it is doing well or not. We still do not know whether the use of an advisor has interesting practical applications; it seemsto us that it could be useful in situations wherethe environment is so unpredictablethat eventhe application of the most reasonablecontrol strategywould frequently lead to errors. In such situations, it would not be possibleto avoid errors through further learning ; error recoverywould therefore seemto be an appealing alternative.
6.5 EXPERIMENTAL RFSULTS This sectionreports the resultsof the experimentson the { aP} * behavior. The experimentsdescribedare the following : . Experimen~ 1- 3: sequential behavior with external-based transitions and flexible reinforcementprogram. . Experiments 4- 6: sequential behavior with result-based transitions, rigid reinforcementprogram and reinforcementsensor. The performancesof the coordinator tasks (" maintain" and " switch" )
are then compared by the Mann -Whitney test.
6.5.1 Experiment1: External- BasedTnnsitio . . and flexible ReinforcementProgram In this experiment, transition signals are produced randomly, with an averageof one signal every 50 cycles. The Autono Mouse is trained with the flexible reinforcement program, RPncx. Figure 6.3 shows a typical learning curve for the coordinator learning sessionand reports the mean and standarddeviation of the performancesobtained in the test sessionof twelve trials. It appearsthat the Autono Mouse learns to maintain the current phase (in absenceof a transition signal) better than to switch phase (when it perceivesa transition signal) . This result is easyto interpret: as transition signals are relatively rare, the Autono Mouse learns to maintain the current phasefaster than it learns to switch phase. As a whole, however, the performance of the coordinator is not fully satisfactory, at least as far as the switch task is concerned. One factor that
(
133
1 0.9 0.8 0.7 0.6
Beyond Reactive Behavior
Perfonnance in test session: basic behaviorl
. . ~ ' ~ ~ ~ ' . " . , , . ' # , . ' . . .'I. 4 8 12 Numberof cycles ( thousands ) ---.. .-Switch Maintain
Mean Std. dev.
Task a
Task .B
0.9155 0.0656
0.8850 0.1192
Perfonnance in test session: coordination
Mean Std. dev.
Maintain
Switch
0.9337 0.0471
0.8276 0.0872
figure 6.3 Experiment 1: Learning sequentialbehavior with external-basedtransitions
keeps the performance of the coordinator well below 1 is that the per 1. formancesof the two basic behaviors are not close enough to Indeed, during the training of the coordinator, an action may be punishedeven if the coordinator has acted correctly, if a wrong move is proposed by the relevant basic LCS. There is, however, another reasonwhy the learning of the coordination tasks is not satisfactory: RPoexcannot teach perfect coordination because there are ambiguous situations, that is, situations where it is not clear whether the reinforcement program should reward or punish the agent. Suppose, for example, that the Autono Mouse perceivesa transition signal at cycle t when it is approachingA on a curvilinear trajectory like the one shown in figure 6.4, and that at cycle t + 1 it keeps on following the sametrajectory. By observing this behavior, one cannot tell whether the Autono Mouse decidedto keep on approaching A or whether it changed ' phase and is now turning to approach object B. Becausethe agent s behavior is ambiguous, any reinforcementactually runs the risk of saying the opposite of what it is intended to. Ambiguous situations of the type describedearlier arise becausethe ' agent s internal state is hidden from the point of view of the trainer. One . possiblesolution is to make the relevant part of this stateknown to RPoex This was implementedin the next experiment.
134
6 Chapter
0 A I I
. \ \ " ,
'
, ~
. B
Figwe6.4 Ambiguoussituation 6.5.2 Experiment2: External-BasedTransitions, Flexible Reinforeement Program , and Agent-to-Trainer CofDI Dunication To eliminate ambiguoussituations, we simulate a communication process from the agent to the trainer: better reinforcementscan be generatedif the agent communicatesits state to the reinforcement program, because situations like the one describedearlier are no longer ambiguous. To achieve this result, we added to the Autono Mouse the ability to assumetwo externally observable states, which we conventionally call " colors." The Autono Mouse can be either " white" or " black," and can assumeeither color as the result of a specific " chromatic" action. The trainer can observe the Autono Mouse' s color at any time. The basic modulesare now able to perform one more action at eachcycle, that is, to set a color bit to 0 (white) or to 1 (black); for this purpose, an extra bit is added to the action part of classifiers. In the basic learning session, the Autono Mouse is trained not only to perform the approachingbehaviorsIX and p but also to associatea single color to each of them. During the coordinator learning session, RPflexexploits information about the color to disambiguate the Autono Mouse' s internal state. The results of this experimentare reported in figure 6.5. 6.5.3 Experiment3: External-Basedl Transitions , flexibleReinforeement Program , and Tra . rer of the Coordinator Another interesting solution to the problem of ambiguous situations is basedon the notion of transferring behavioral coordination. The idea is
1 0.9 ~.J-~JJ . _ . ' ~ J J 0.8 I,~I",J 0.7 I.,I", 0.60 4 8 12
135
Beyond Reactive Behavior
(
Perfonnance in test session: basic behavior
Mear1 Std. dev.
Perfonnance
Task a
Task 8
0.8893 0.0650
0.9442 0.0509
in test session : coordination Maintain
Switch
Number of cycles( thousands )
Maintain ....- ... Switch Figure6.5 Experiment2: External-basedtransitionsandagent-to-trainercommunication . that the { c:xP} * behavior is basedon two components: the ability to perform the basic behaviors c:xand p; and the ability to coordinate them in order to achieve the required sequence . While the basic behaviors are strongly linked to the environment, coordination is abstract enough to be learned in one environment, and then transferred to another. Therefore, we proceedas follows: 1. The Autono Mouse learns the complete { c:xP} * behavior in a simpler environment, where the ambiguity problem doesnot arise; 2. The Autono Mouse is then trained to perform the two basic behaviors in the target environment; and 3. Finally , the coordinator rules learned in the simpler environment are copied into the coordinator for the target task. For this purpose, we select the rules that have the highest performance in the simpler environment; therefore, all twelve experimentsin the target environment are run with the samecoordinator. The simpler, unambiguousenvironment usedfor coordination training is sketchedin figure 6.6. It is a one- dimensionalcounterpart of the target environment: the Autono Mouse can only move to the left or to the right on a fixed rail . At eachinstant, the Autono Mouse is either approachingA or approaching B: no ambiguoussituations arise.
136
Chapter 6
,.
...
~
I
~ ...
,.
fig D1 ' e 6.6 I -D environment
As reported in figure 6.7, due to the simplicity of the task, the performance achieved in the ID environment is almost perfect. Figure 6.8 showsthe results obtained by transferring the coordinator to an AutonoMouse that had previously learned the basic behaviors in the 2-D environmen . The reason why the performance of the coordination task is in the case of agent-to-trainer communication is that with than higher agent to trainer communication the significanceof the communication bit had to be learned. 6.5.4 Comparisonof Experiments1- 3 In experiments1- 3, three pairwise comparisonsare possible, only two of which are statistically independent. We decidedto use experiment 1 as a control and to compare the other two experimentswith it . The Mann Whitney test tells us that
. performanceof the maintaintask in experiment2 is not significantly different(p = 0.4529, adjustedfor ties) from that in experiment1; . performanceof the maintain task in experiment3 is higher than in experiment1, but the statisti~ significanceof the differenceis low (p = 0.1059,adjustedfor ties); . performanceof the switch task in experiment2 is higher than in experiment1, but the statisticalsignificanceof the differenceis low (p = 0.0996, adjustedfor ties);
-~ .--~ -~ 04812
137 1
e
-Q -. }~
Beyond Reactive Behavior
Perfonnance in test session: basic behaviors
Mean Std. dev.
Task a
Task /3
0.9999 0.0003
0.9999 0.0002
Perfonnancein test session: coordination Maintain
Switch
Number of cycles( thousands )
Maintain ----- --- Switch
F1pre 6.7 Experiment3: External-basedtransitionsin 1-0 environment Perfonnance in test session: basic behaviors Task a
TaskfJ
Mean Std. dev.
Perfonnance in test session: coordination Maintain
Switch
Mean Std. dev. F1gore6.8 Experiment 3: Transferring external-based coordinator from ID environment
to 2-D
138
Chapter 6
. performanceof the switch task in experiment 3 is higher than in experiment 1, with high statistical significance(p = 0.0018, adjustedfor ties) . Taking into account the mean performancesof the switch task in the three experiments(seefigures 6.3, 6.5, and 6.8), we conclude that , with external-basedtransitions, 1. The maintain task is learned reasonably well with all the strategies tested; transferring the coordinator appearsto be the best approach, even if the differencewith direct learning (experiment 1) cannot be considered statistically significant; and 2. For the switch task, transfer of the coordinator clearly appearsto be the best approach, although agent-to -trainer communication also seems to overcomethe problem of learning in ambiguoussituations, albeit with low statistical significance. 6.5.5 Experiment 4: Result- Based Tra8itions with Reinforcement Sensor
and Rigid Reinforcement Program
' This and the next two experimentsare run with result-basedtransitions, that is, a transition signal is generated each time the Autono Mouse reachesan object. The target behavior is therefore conceived as: reach object A , then reach object B, and so on. Consistentwith this view of the target behavior, we adopt the rigid reinforcementprogram RPrigand give the Autono Mouse a one-bit reinforcementsensor. The results, reported in figure 6.9, show that the target behavior is learned, although the performance of the switch task is rather poor in comparisonwith experiment 1, possiblybecausethe presenceof new input information (the reinforcementsensor) makesthe task more complex.
6.5.6 Experiment5: Result-BasedTransitions, Rigid ReinforcementProgram with ReinforcementSensor, and Agent-to- Trainer Communication This experimentis the result-basedanalogueof experiment2: the Autono Mouse is trained to assumea color, thus revealing its internal state. The results are reported in figure 6.10. As was the casefor experiment 2, the coordinator performanceincreasesnoticeably. 6.5.7 Experiment6: Result-BasedTransitions, Rigid ReinforcementProgram, and Transfer of the Coordinator This experimentis the result-basedanalogueof experiment 3. Figure 6.11 showsthe results obtained in the ID environment, and figure 6.12 gives
1 .0.8 0 9 , f I 1 \ II . " -,.-"-,J.-,I.'0.',.I' .0.6 0 7.":I'I',,I'",I.,"v.'\0
(
139
Beyond Reactive Behavior
4
8
Numberof cycles( thousands ) Maintain
Perfonnance in test session: basic behaviors
Mean Std. dev.
Taska
TaskD
0.9451 0.0426
0.9743 0.0400
Performance in test session: coordination
Mean Std. dev.
Maintain
Switch
0.9313 0.0528
0.7166 0.1559
-------- Switch
1 _.-.-,'.~ .0.8 0 9 ', ~.,-.-.-.~ , , .~ ,I .0.6 0 7II'"II 0 4 12
(
figure 6.9 Experiment 4: Learning sequentialbehavior with result-basedtransitions
Performancein test session: basic behaviors
Mean Std. dev.
Task a
Task fJ
0.9838 0.0201
0.9879 0.0144
Perfonnance in test session: coordination Maintain
Switch
8
Number of cycles ( thousands) Maintain
. . . . - -. . Switch
Figure 6.10 Experiment 5: Result -based transitions and agent -to- trainer communication
1 '~ ' ' . . _ , " ~ . . r . . 0.9 ,.I.,:.'., 0.8I.I.' 0.7A !.: . , ~ 0.60 4 8 12
140
Chapter 6
(
Perfonnance in test session: basic behaviors
Task a
Task .8
Perfonnance in test session: coordination Maintain
Switch
Number of cycles ( thousands) Maintain
----- --- Switch
figure 6.11 Experiment 6: Result -based transitions
in 1-0 environment
Performance in test session : basic behaviors
Mean Std. dev.
Task a
Task8
0.9935 0.0067
0.9895 0.0112
Perfonnance in test session: coordination
Maintain Switch Mean Std. dev. Ji1~ 6.12 Experiment 6: Transferring result-basedcoordinator from 1-D to 2-D environment
Beyond Reactive Behavior
Exp. 1
Exp. 4
/ \ 0.4529 0.1059 0000
1 \ 0.0001 0.0012 32 00000
7996 Exp. 2
\ Exp. 3
7 Exp. 5
\ Exp. 6
figure6.13 " aboveand" switch " Resultof Mann-Whitneytestsfor the" maintain ( ) (below ) tasks . Arrowsindicatethedirectionof increasing valueof performance (e.g., the of themaintain taskin experiment 4 islowerthanin bothexperiment performance 5 andexperiment 6). the performances of the Autono Mouse in the 2-D environment, after transferring the best coordinator obtained in the ID environment. As with external-basedtransitions, the performanceof the coordination task is higher in the case of transfer of the coordinator than in the case of agent-to -trainer communication. Again, this is becausethe significanceof the communication bit has to be learned. 6.5.8 Comparmn of Experiments4- 6 In experiments 4- 6, three pairwise comparisons are possible, two of which are independent. We decidedto use experiment4 as a control and to compare the other two experimentswith it . In both cases, the Mann Whitney test tells us that there are highly significant differences: . performancein experiment 5 is higher than in experiment4, with high statistical significancefor both the maintain task (p = 0.0012) and the switch task (p = 0.0032); . performancein experiment 6 is higher than in experiment 4, with high statistical significancefor both the maintain task (p = 0.0001) and the switch task (p = 0.0002) . Taking into account the mean performancesof the two tasks in these " (seefigures 6.9, 6.10 and 6.12), we concludethat, with resultexperiments based transitions, coordinator transfer and agent-ta-trainer communication significantly overcome the problem of learning in ambiguous situations, both for the maintain and for the Switch task. Figure 6.13
142
Chapter 6
sl!~ marizesthe resultsof the Mann-Whitneytestsfor all relevantpairs of this set of experiments .
6.6 POINTSTO REMEMBER . In the context of artificial systems, the term internal state can beprofit , and goal. ably usedto replaceterms like memory, representation . Pseudosequences are thosetasksperformed by a reactiveagent that are the dynamic nature of the environment. in virtue of sequential .. Proper sequencesare sequentialtasks performed by a dynamic agent, that is, an agent endowedwith internal state. . The best policy to train an agent to perform a proper sequencedepends on the way in which transition signals are generated: external-based transition signals call for a flexible reinforcement program; result-based transition signalscall for a rigid reinforcementprogram. . To improve the performanceof systemslearning under the guidanceof a rigid reinforcement program, it is useful to enrich the Autono Mouse . with a reinforcementsensor(this can be seenas a kind of trainer-to-agent communication) . The use of a reinforcement sensordeterminesas a side effect the generation of error recovery rules, which are activated only when the systemmakesan error. . For both flexible and rigid reinforcement programs, the use of agentto-trainer communication through some observable behavior improves performance. Performance can be further improved by developing the module in charge of controlling the sequencein a simpler environment where the basic sequential structure of the task, that is, { liP} . is maintained , but where ambiguities in judging the Autono Mouse behavior are removed.
Chapter
7
The Behavior
Analysis
and Trai Di Dg Methodology
7.1 I NTRODUcn ON As we suggestedin chapter 1, we view our own work as a contribution to a unified discipline of agent developmentthat we call " behavior engineering ." Engineering of any kind is a human activity aimed to the development of artificial products and basedon a technical apparatus including fundamental concepts, mathematical models, techniquesfor the application of suchmodels, hardware and softwaretools, and methodologies. We think that behavior engineeringshould be no exception. Our main concernin this book has beento show that learning can have an important role in behavior engineering. As regards robot learning, today there is a wide repertoire of reasonably well understood concepts and models, and techniques for applying them to concrete problems. Moreover, several software tools are available to develop learning systems ; in particular, the previous chaptershave concentratedmainly on the learning classifiersystem( LCS) as a model and on ALECsvsas a tool. By contrast, there is no widespreadconcern about the methodological issues involved in the use of learning as a component of robot development. While the previous chapters have presentedvarious piecesof the whole _marizes our picture, in this chapter we propose a methodology that suD1 as and builders of experiences designers learning autonomousrobots. Section 7.2 presentsa methodology for behavior engineeringthat we call BAT ( Behavior Analysis and Training) . Sections 7~3, 7.4, and 7.S support the methodology with the description of three practical cases. The This chapteris basedon the article" BehaviorAnalysisand Training: A Methodologyfor " BehaviorEngineering , M. Dorigo, and G. Borghi, whichappearedin by M. Colombetti IEEE Transactions onSystems Part B 26 (3), @ 1996IEEE. , Man, andCybernetics
144
Chapter 7
first case, regarding Autono Mouse V , builds upon experiments already presented ( however sparsely) in the previous chapters . The other two cases are examples of an application of the BAT methodology to commerciall available robots .
7.2 THE BATMEm ODOLOGY
The goal of any engineeringmethodology is to help engineersto develop high- quality products. This involves clarifying various aspectsconnected with the specification, design, implementation, and assessmentof the product. The first important point to be settled is, how is the quality of the product going to be defined? Product quality is a complex function of many different componentsand cannot be measuredby any simple scalar index. In general, somecomponentsof quality can be measuredin quantitative terms, while others can only be qualitatively evaluated. In any case, a clear definition of all important aspectsof quality may have an important imPact on the developmentof the product. Before being designed, a product has to be clearly specified. Specifications are important for two reasons: first, they guide the design process; and second, they definea referenceagainstwhich the final product will be assessed . Moreover, proper specificationsare critical becauseerrors in specificationsare often discoveredonly in the later stagesof development, when correcting them can be very costly. The core of the development process is design. At a fairly abstract level, a design processcan be viewed as a sequenceof decisions. Even " " though it is of course not possible to define an algorithm leading to perfect design, it is possible to point out what are the most critical decisions to be made, in what order they should be taken, and on what grounds. This is precisely what a ' designer wants from a " " methodology. Finally, the developer should know how to assessthe quality of a product, that is, how to practically evaluate the different componentsof quality . In this section, we shall discussall thesefacets from the point of view of agent development. In the following sections, we shall have to distinguish clearly between different componentsof the robot , and in particular betweenits " body" and its " mind." We shall describea robot as made up of a " shell" (the inert part of its " body" ), a " sensorimotor interface" (including sensors and effectors), and a " controller" (its " mind " ).
145 To keep our treatment close to common robotic practice, we assume that the problem faced by the engineeris to developan autonomousrobot in a situation where both robot shell and environment are essentially predefined. Of course, we are interestedin the caseswhere the robot controller includesa learned component. In our experience, the main stagesin the development of a robotic application are 1. To describethe robot shell and the initial environment, and to define the robot ' s target behavior, that is, the desired pattern of interaction betweenthe robot and its environment; 2. To analyze the target behavior and decomposeit into a structured collection of simple behaviors; 3. To completely specify the various components of the robot , in particular : i ( ) the sensorsand effectorsinterfacing the robot with its environment, possibly together with some artificial extension of the environment to " " make percepti Qn and action possible(for example, the preys we often dealt with actively produce a signal that allows the robot to locate them); ' (ii ) the controller architecture, that is, the overall structure of the robot s control program; and (iii ) a training strategy, that is, a systematicprocedure for trainin $ the robot to perform the target behavior; 4. To design, implement, and verify the nascentrobot, that is, the robot prior to trainingS . To carry out training until the target behavior is learned; and 6. To assessthe robot ' s learningprocessand itsfinal behaviorwith respect to the specifiedtarget behavior. Thesestagescan be arranged in a logical sequence(fig . 7.1), in which each stage exploits at least part of the outputs of previous stages. It is important to note, however, that the temporalsequencein which activities are carried out will in general not be a simple traversal of the logical sequencefrom the beginning to the end. Someactivities may start before the previous one has beenterminated; often, inadequacieswill be detected that will force the developerto cycle back to a previous stage; and sometimes it will be even necessaryto go through the whole logical sequence more than once, thus organizing the development processaccording to the spiral model(see, for example, Ghezzi, Jazayeri, and Mandrioli 1991). In this chapter we shall be concernedonly with the logical relationships among stages.
146
7 Chapter Application Descript ion and Behavior Requirements Behavior
Analysis
Specification
Design , Implement at ion and Verification of t he Nascent Robot Training Behavior
Assessment
Figure7.1 of stagesin BATmethodology Logicalsequence 7.2.1 AppHcationDescriptionand BehaviorRequirements The outputs Qf this stageare . a description of the robot shell; . a description of the initial environment; and . a set of requirementson the target behavior. Clearly, a specification of the target behavior presupposesa clear description of both the robot and its physical environment. However, neither of the two can be thoroughly described before completing the specificationstage(seesection 7.2.3) . The reasonis that the sensorimotor ' apparatus of the robot can be fully designedonly after the robot s behavior has beenanalyzedin detail; furthermore, the useof certain types of sensorsor effectors may require a modification of the environment to achieve a satisfactory coupling betweenthe robot and the environment. Therefore, in this initial stagewe can only describethe robot shell (with incomplete or no sensorimotor apparatus) and the initial environment (prior to possiblemodification) . The requirementsfor the target behavior are often stated in qualitative terms; for example, we may state that the robot will have to inspect an area, collecting objects of a given type. Such an informal description of the target behavior, evenwhen clear and unambiguous, is not sufficient as a basisfor a later quantitative assessmentof the robot ' s behavior. Therefore , a complete specification of the target behavior should also include
147
BehaviorAnalysisandTrainingMethodology a formal , quantitative component. It is up to the subsequentbehavior assessmentstageto compare data from robot testing with such specification (seesection 7.2.6) .
7.2.2 Behanor Analysis This stagetakes as inputs the outputs of the previous stage, and produces as output a description of the structured behavior. The analysis of behavior and the definition of its structure are key aspectsof our method. The ology target behavior is first decomposedinto simpler component behaviors, which can often be be further decomposedinto simpler ones, and so on, producing a hierarchical structure. The product of this decompos is what we call " structured behavior." Again , it is desirable that the component behaviorsbe specifiedin a rigorous, quantitative way so that such specification can be compared with data coming from behavior assessment (seesection 7.2.6) . There is no simple rule about how to carve simple behaviors out of " more complex.ones. For example, although " exploring the environment can only be establishedas a single basic behavior on the basis of general knowledge about robot engineeringand past designexperience, once the basic behaviors have been singled out, their interactions can and should be completely defined. What types of interactions among behaviors should one take into account? Again , there is no general answer becausethis issue is strictly connectedto the kind of controller architecture one is going to design. In section3.1, we singledout the following types of interaction: independent . As we shall see, a clear sum, combination, suppression , and sequence description of the interactions between simple behaviors is essentialfor the specificationstage.
7.2.3 Specification Thisstagetakesasinput the outputof thepreviousstage , andproducesas of outputs the specifications . . . .
the the the the
' robot s sensors and effectors ; extended environment ; controller architecture ; and training strategy .
As we have already said, both the robot shell and the initial environment are assumedto be given. In general, however, the sensorimotor
148
Chapter
interface of the robot has to be at least partially redesignedfor each specific behavior. For example, we may find out that objects in the environment can be roughly located through a sonar belt, and identified through a bar code reader. Analogously, requirementson motor control lead to decisionson the kind of output that the robot controller will have to send to the effectors. For example, supposewe need to control a platform moving on two tracks, connectedto two independentmotors. Assuming that the controller sendsa messageto each motor at every control cycle, we may decide that the controller' s output can be one of the following : " Move forward" " Do not move" " " ; ; Move backward. Given the state of the art in robotics, robots are often unable to deal directly with the initial environment. For example, it may be necessaryto modify the objects with which the robot has to interact so that they can be detected, identified, and located. It is part of the specificationstageto find out what must be added to the environment to make the target behavior " possible. The environment thus equippedwill be called extended " environment. Another wk of the specification stage is to establish the controller architecture. In our work , we have experimentedwith various kinds of architecture, implemented via ALECSYS . In section 3.3, we contended that the architecture of the controller should parallel the organization of the structured behavior, as established by behavior analysis. This result was obtained by allocating a behavioral module to each simple behavior. Architectures are classifiedas discussedin section 3.3. Table 7.1 summarize the best correspondencesbetween architecture types and mechanisms for building complex behaviors. Examples of the use of such architecturescan be found in the previous chapters.
Independent Combination
Monolithic Flat with independent outputs Flat with integratedoutputs Hierarchical
x x
Suppression
Sequence
149
Behavior Analysis and Training Methodology
Finally , there is one more goal for the specificationstage. As we have , the BATmethodology assumesthat behaviorsare allocated already suggested to behavioral modulesby design, but the function of eachmodule is developedby machine learning techniques, and we have assumedthat the learning system is chosenin advance. However, one should not believe that having a learning systemat hand solvesall the problems. Even if the agentis able to learn, one still has to teachthe right thing in the right way; in other words, one has to establisha training strategy. Once more, there is no simple, generalrule that can always lead to the right decisions. Moreover, even the type of decisions to be made will dependon the kind of learning systemadopted. Developerswho rely on somekind of reinforcementlearning have at least to establish . the reinforcementprogram, that is, the information that will be provided to the learning systemto make the robot convergeto the target behavior; and . the shapingpo/icy, that is, whether the structured behavior should be learnedin one shot or by a sequenceof learning sessions , and if the latter, in which order the various behaviorsand their interactionswill have to be learned. Such decisionsare critical becausetraining is strongly responsiblefor the final performanceof the robot. 7.2.4 Design, Implementation, and Veriftcation of the NascentRobot The documentsproduced by the previous stagesprovide sufficient information to design, implement, and verify the nascentrobot, that is, the robot endowedwith all its hardware and software componentsbefore it is trained. We do not suggestany specificprocedure to be followed regarding this stage: designersshould exploit known methodologiesfor hardware and softwareengineering. The output of the stageis a physical agent that has to be trained to perform the target behavior. 7.2.5 Training Training acts on the nascentrobot according to the establishedtraining strategy. Its output will be the complete robot. The main problem of training is that it can be an extremely expensive task: even the most efficient learning systemsconvergeslowly, and therefore . In many casesof training may require numerous lengthy sessions to is interest it , possible speedup training through the use of practical
150
Chapter 7
simulators, with only a fraction of the effort neededto train real robots. Another advantageof simulation is that it allows one to carry out training in environmentsthat are more manageablethan the real one in someimportant respect; we have exploited this aspect in the case presentedin Section7.4. In general, however, simulation is not sufficientbecausemuch of the richnessand unpredictability of the real world is lost. A reasonable compromise involves using a simulator to develop a first approximation of the final controller, which can then be refined through direct training of the real robot. 7.2.6 BehaviorA_ ent Before introducing an assessmentprocedure, we need to clarify our notion of the quality of behavior. Because , broadly speaking, a robot is a hardware/ software system, in principle we might simply borrow the quality components identified by hardware and software engineering: robustness, reliability , usability, portability , and so on. However, it seems to us that robots are specialenough to deservean independenttreatment. . For example, in classicalsoftware engineeringone would sharply distinguish betweencorrectness , robustness, and performanceof a program. In : particular . Correctnessis a yes/ no property. A program is said to be " correct" (with respectto an input -output specification) when it yields the specified outputs for all allowable inputs. . Robustness , defined as the ability of the program to behaveacceptably even when its input data lie outside the specifiedinput set, is generally evaluatedin qualitative terms. . Performance is defined over a collection of metrics measuring the amount of resources(including memory and CPU time) necessaryto produce an output. A reasonablegoal for a software designeris to maximize performance, given that correctnessis achieved together with an acceptablelevel of robustness.! It is easyto seethat the conceptspreviously defined are not really meaningful for an agent: . In general, there is no clear yes/ no notion of correctnessof behavior. Take, for example, an exploration behavior. It would be awkward to give a Boolean definition of exploration; in practical cases, one is interestedin the amount of territory explored in a given amount of time- a concept akin more to performancethan to correctness.
and Training Behavior Analysis Methodology
. The robustnessof behavior has not so much to do with input data lying outside a predefinedrange as it does with the variability of the environment : intuitively , an agentis robust if it is able to carry out its task evenin unexpectedconditions. . Behavioral performancehas to do not only with required resourcesbut also with the degreeof accomplishmentof the agent' s task (as remarked above) .
Other interesting indices of behavioral quality areflexibility (defined as the richnessof the agent' s behavioral repertoire) and adaptiveness (defined as the ability to adapt in real time to changesin the structure or in the 2 dynamicsof the environment) . Thus far , our approach has beenthe following . First , we define performance in such a way that a high level of performanceimplies the execution of the target behavior. Second, we definean RP that is monotonically related to performance (in such a way that higher rewards imply higher performance) . Then we usereinforcementlearning to developa controller that maximizesreward, and thus performance. It is interesting to note that even if we have not explicitly pursued components of quality different from performance, our agents show a , thanks to the properties of good level of robustnessand adaptiveness LCSs and of reinforcement learning (seesection 5.2.4) . Indeed, the performanc of LCSs tends to degrade fairly gracefully, and learning is a powerful way of coping with structural changesin the agent- environment interaction. In principle, however, there is no reasonwhy the direct concern about quality should be confined to performance. It is feasible to include in the RP an evaluation not only of performancebut also of other quality components, although more work is neededto seewhether this approach is practically viable. To evaluate performance, we have used two types of performance indexes: 1. Local performanceindexes(or learning indexes) measurethe effectiveness of the learning process(that is, the correspondencebetweenwhat is taught and what is learned) . In general, local performanceis defined as
P(t)=~t ' wheret is the number of robot movesfrom the beginningof the experiment and R(t ) is the number of movesthat have beenpositively reinforced from
152
Chapter the beginning of the experiment. This is the sameindex, first defined in section 4.1.1, that we used to evaluate the results of simulation experiments presentedin the previous chapters. Note that local performance indexesallow one to evaluate the effectivenessof all kinds of behavior, either basic or coordination ones. 2. Globalperformanceindexesmeasurethe correspondenceof the robot ' s behaviors with the target behavior (as defined by the corresponding requirements). Typically, global performanceat t is computed as 6 (t) =
t A( t )
'
where A (t ) is the number of achievementsfrom the beginning of the experiment(i.e., the number of times the agenthas reachedthe goal). This measureis usedwhen A (t ) is not fixed in advance, as in the Autono Mouse V experimentsreported in section7.3. When the number of achievements is predefined, as in the HAMSTER . experimentsin section7.4, G is computed as the total number of moves necessaryto obtain them. Note that , contrary to the previous index P (t ) , a low value of G(t ) denotesa high performan , and vice versa. Given that we use immediate reinforcementsto train the robot to perform its global task, local indexesare suitable for the assessmentof the of the learning process, and global indexesare suitable for the assessment robot ' s global behavior. Note that a robot might reach a high local perfonnance without even starting to perform the required task. Consider a robot whose target behavior is to reach a moving light . Typically , we would train the robot to move in the direction of the light, with the underlying belief that the robot will eventually reach its goal. However, if the light runs too fast, the robot will never reach it (low global performance ), even if it moves in the right direction (high local performance) . Actually , the two indexesare related to different facts: local performance has only to do with learning what one is taught to do, while global perfonnance has also to do with being taught the right thing. It is part of the specificationof the training strategy to make sure that , given the particular environmental conditions the robot is going to face, a high local perfonnance will imply a high global performance. In any case, it is up to the assessmentstageto prove that this is the case. To concludethis section, figure 7.2 summarizesall the stagespreviously described. The two large gray arrows are there to remind us that assess
153
Behavior Analysis and Training Methodology
Application and Behavior
Descript
ion
Requirements
Description of Robot Shell Description of Initial Environment Requirements on Target Behavior
Behavior
.
Data
. .-
Global
on
Behavior
Behavior
Analysis
.
Structured Behavior
Specific
.
4
Data on
~
Local
Behavior
Training Strategy
at ion
Assessment
Robot
Training
.
Sensors and Actuators Extended Environment Controler Architecture
. Design . Implementation of the and Verification Nascent
r Nascent Robot
Robot
Ji1a - e 7.2 of BATmethodology Globalscheme ment essentiallyinvolves a comparison betweenexperimentaldata and a , data on global behavior will specification. As we have already suggested be compared with the requirements on the target behavior, and local performancedata with the specificationof the structured behavior.
7.3 CASE 1: AUTONOMOUSE V This section instantiates the BAT methodology using a simple robot , Autono Mouse V , as a running example. The application chosenis very similar to the Find Hidden task briefly discussedin section 5.3. Here we give a much more detailed description of the whole experiment, which was run using Autono Mouse V , a version of Autono Mouse IV enriched with a rotation sensor.
154
Chapter ? , visual
left cone ,
beam
sonar ~
~
~
~
"
< "
< "
<
~,
,
t
"
t
.
\
.
\
~
rotation sensor
", , t visual righcone ",
--vcone left ,,"I:-"L isual "-.'--_ right visual .cone '
.
bumpers
sonar
Figure7.3 Functionalschematicof Autono Mouse V
I1gure 7.4 Autono Mouse V light sensors
7.3.1 Application Descriptionand Debanor Requirements Autono Mouse V is a small mobile robot, 35 cm long (plus 26 cm for the tail ), 15 cm wide, and 28 cm high. Its sensoryapparatusis two light sensors , one sonar, three bumpers, and a rotation sensorable to detect and measurea changeof direction (seefigure 7.3) . It is also provided with two motors that control two tracks. The robot carries a battery that provides it with three hours of autonomy. The light sensorsare two photodiodes, positioned within a structure that makesthem partially directional devices. The two eyestogethercover a 270- degreezone, with an overlapping of 90 degreesin front of the robot (seefigure 7.4) . They can distinguish 256 levelsof light intensity. The sonar is highly directional (it detectsobstaclesin front of the robot) and can sensean object as far away asmeters . It outputs an eight-bit number between 0 and 256, coding its estimate of the distance to the obstacle.
155
Behavior Analysis and Training Methodology
.-II_Z
left bumper . front
bumper
right bumper
Figure7.5 AutonoMouseV bumpers The three bumpers, placed on the front of Autono Mouse V (seefigure 7.5), are devicesthat changestatewhen the robot bumps into an obstacle. The directional sensoris a rod, which we call a " tail ," with a wheel at its end that can rotate around the rod axis. It therefore rotates when the robot turns, but not when the robot moves forward without turning . Its resolution is about 1.2 degrees . Each motor has nineteen activation speeds : nine forward , nine backward and one for not . , moving Autono Mo ~ V has some onboard computing capabilities to transform sensoryinput into digital messagesin the format used by the learning algorithms. The learning algorithms run on a transputer board in a host computer connectedto the robot via a 4,800-baud infrared link . Autono Mouse V moves in a officelike environment lightened by artifi ' ciallights . The terrain is smooth, people do not interfere with the robot s movements, and there are no obstaclesother than walls and the obstacle specificallyusedin experiments. In this simple application, we want our robot to learn to search and follow a light moving at a speedcomparable to the speedof Autono Mouse V . Now and then, the light disappearsbehind an obstacle; on these occasions, Autono Mouse V has to go around the obstacleto seewhether the light is on the other side and then start to follow it again (seefigure 7.6) . We call the target behavior the " Find Hidden behavior." 7.3.2 BehaviorAnalysis The target behavior of our robot, describedinformally in the previous section, can be expressedas the following structured behavior:
Find Hidden = Chase " Search That is, Chase and Search are the two basic behaviors constituting the structured behavior, and the interaction among them . is suppression.
156
7 Chapter
~ sonar beam .
.
.
.
.
.
. ~ 1IIIII1I . . . wall
.
figure7.6 Experimental environment for Find Hidden behavior
Chase is active when the light is seen. Search is a rather complex behavior that includes turning around to look for the light or the wall , ' approaching the wall when it is sensedby the sonar, and trailing around the wall when it is sensedby the bumpers.
7.3.3 Speciftcadon The sensorsdefined in section 7.3.1 are too fine-grained for our application . When using a learning algorithm, it should be rememberedthat the greater the amount of sensoryinformation , the greater the dimension of the searchspace. It is therefore advisableto choosethe coarsestsensorial granularity that still allows the learning algorithm to distinguish among different environmental states from the task point of view. This was experimentallydemonstratedin section 5.3.4. We choseto put a threshold on both light sensorsand the sonar sensor, transforming them into on/ off devices. The output of the light sensorsis therefore either " I seea light " or " I do not seea light," while the output of the sonar is either " I sensean object" or " I do not sensean object" (the threshold for the sonar was chosensuchthat objectsare sensedif they are closer than 1.5 metersaway) . The direction sensor, that is, the tail , was thresholdedso that Autono Mouse V knows that it has changedits direction wheneverit turns by an . In order not to burden the learning system angle greaterthan 1.2 degrees with uselessextra search spacegiven by too detailed control on movements , motor moves have been thresholdedso that tracks are moved by
157
Behavior Analysis and Training Methodology
pleated wall
-
-
. . . .
sonar
beam . . . . . . . . . . . . .
A
7.7 Ji1gare Pleatedwall sending two-bit messagesthat correspond to the following possible movementsof each track: " Move backward at speedone" , " Stay still " , " Move forward at " " " speedone , an~ Move forward at speedtwo. Speed two correspondsto the maximum velocity of the robot , and speedone, to one half of maximum velocity. The initial environment was extended by a SOW lithgbulb moved . around by one of the experimenters, and by a pleated wall obstacle (see figure 7.7) . A pleated wall was chosen becauseour single and highly directional sonar lacked the capacity to perceivesmooth surfacesunder all the possibleanglesof incidenceof the sonar signal. Given the kind of structured behavior describedin section 7.3.2, the switch architectureis the most appropriate and was used for all Autono Mouse V experiments(this architecturewas comparedwith other possible solutions both in real-world experimentsand in simulation; see Dorigo and Maesani 1993). In this architecture, the light approaching and the light searchingbehaviors ( basic behaviors) are coordinated by a switch module that learns to choosewhich basic behavior to give priority to. As regards the training strategy, there are two decisions to be made: the reinforcementprogram and the shapingpolicy .
7.3.3.1 ReinforcementProgram The reinforcement program is in charge of giving reinforcementsafter each move of Autono Mouse V. It is usually composedof a different subprogram for each basic behavior. For Autono Mouse V , we defined the following reinforcementprograms for eachof the two behaviorsidentified in section7.3.2: 1. Light - approachingreinforcementprogram. After each move, the robot is rewarded if its distance from the light decreases ; otherwise, it is
158
7 Chapter
punished. Changesin distanceare evaluatedby measuring the changein intensity of light as seen by the two eyes. Autono Mouse V eyes have therefore a double use: as agent sensors(in which case, they are thresholded ), and as trainer sensors(in which case, their full granularity is used to evaluatechangesin light intensity) . 2. Light -searchingreinforcementprogram. Light searchingis not as easy to define as light approachingin that it allows the definition of a few different searchingstrategies. The main goal of this behavior is to search behind obstaclesto seewhether the light is there. For example, the agent could first use sonars to locate the edge of the obstacle, and then move directly in that direction. Another possiblestrategycould beto usethe sonar to, locate an obstacle, then approach it, and finally turn around it using bumpersto maintain contact with the obstacle(this could be done with a reinforcementprogram that rewards a right turn when the left bumper is on, and a left turn when it is off ; in this way Autono Mouse V will learn to move along the obstaclewith a zigzag kind of motion) . Theseand a few other strategieswere investigatedby Dorigo and Maesani ( 1993) . In the experimentspresentedhere, we usedthe first strategy, which works as follows : if the sonar sensesan obstacle, then the robot is rewardedif it moves forward and at the sametime it turns in the samedirection of the previous move. As soon as the sonar no longer sensesthe obstacle, the robot is rewardedif , while still moving forward , it turns in the opposite direction " " ( Move forward while turning movesare obtainedby settingonemotor to speedtwo and the other to speedone) . This reinforcementpolicy generates an obstacle-approachingtrajectory like the one presentedin figure 7.8.
~
11 _
ll
. . .
F1~ 7.8 behavior Light-searching
~
159
Behavior Analysis and Training Methodolo~
7.3.3.2 ShapingPolicy The shapingpolicy usedwas modular. First we trained the two basic behaviors , thenfrozethem(that is, theirlearningalgorithmswereswitched off ) andlet the coordinationbehaviorlearn. 7.3.4 Design, Implementadon, and Verification This step exploits standard hardware and software engineeringmethodologies . All the designand implementation details regarding this stageof the BATmethodology are documentedin Dorigo and Maesani 1993.
7.3.5 Trai Ding Autono Mouse V was trained using the training strategydefinedin section 7.3.3. The results obtained using the switch architecture with modular shaping were compared with those obtained training the two basic behaviorsin a simulated environment and then copying the two rule sets into the two basic behavioral modules used to control the real robot (we call this the " switch architecture with initial knowledge" ) . After transferring the two rule sets, training continues using the same training and shaping strategiesas in the previous experiment.
7.3.6 BehaviorA~
ent This stageof the BAT methodology is devoted to assessingthe degreeof learning achieved by the learning agent. Two tools are useful for this : the local and the global performance indexes. Figures 7.9 assessment and 7.10 report the behavior of P (t ) and 6 (t ) , as defined in section 7.2.6, for the switch architecture (SA) and the switch architecture with initial knowledgeS Al K ), respectively. Graphs are averagedover five runs. Both indexes show that , given the same amount of learning cycles, starting with someinitial knowledgedevelopedin simulation helps. None of the architecturesachievesvery high local performance. This is mainly due to limited learning time; we stopped the experiment after 6,000 Autono Mouse V moves becauseof time constraints (6,000 moves take about one hour of time) . A longer experiment showed that after 12,000 Autono Mouse V moves, the systemwas still learning. In this experiment, the assessmentof global behavior is somewhat arbitrary , given the highly research-oriented nature of the task. 6 (t ) decreasesas P (t ) increases , an indication that the training strategy is somewhatsuccessfulin driving the learning systemtoward better global performance. Whether the given training strategyis good can be evaluated
160
Chapter 7 1
0.8 -------------- - - -- - - - - - - - - , - - - - - - - - - - - , ~ , ---' -' Q ) ' I ' U C 0 .6 ",- " m E ~ 0 "t Q) Q. 0 .4 "ta u .9 0 .2
n
Figure7.9 Local performanceindex: ComparisonbetweenSA and the SAl K architectures (with localindex, highervaluereflectsa betterperformance ). Averageover5 runs. only by comparing it with other training proceduressuch as the one proposed in section7.3.3. In a more application- orientedtask, 6 (t ) would be comparedagainst quantitative requirementson the target behavior.
7.4 CASE 2: HAMSTER . A secondexampleof the application of the BATmethodology is HAMSTER , a mobile robot basedon a commercial platform, whose task is to bring " food" to its " nest." 3 We shall confine ourselvesto describing the main differences with respect to Autono Mouse V. The chief features of HAMSTERare that it combines " innate" (i.e., prewired) and learned behaviors, and that its training was carried out in a simulated environment and then transferredto the physical robot.
7.4.1 HAMSTERSShell HAMSTERSshellis Robuter, a commercialplatform producedby RoboSoft(seefigure1.5 for a picture). It is 102cm long, 68cmwide, and44 cm
BehaviorAnalysisandTrainingMethodolo~ 300
-C ?o -Q )(c(JU .E g ~ G )o C ~ (0 5100 50 '--.---'-'-------'----'--'--------------'-----00 1 2 3 4 5 6 250 200
150
Number of cycles ( thousands) - - - - Switch architecture with initial knowledge
-
Switch architecture
figure 7.10 Global performanceindex: ComparisonbetweenSA and SAlK architectures ). Averageover5 runs. (with globalindex, lowervaluereflectsbetterperformance high. The configuration we used has a belt of twenty-four Polaroid sonars, surrounding the whole platform . Motion is produced by two motors acting on two independent wheels.
7.4.2 ' Ibe Hoarding Behavior The target behavior is food hoarding, that is, collecting food pieces and storing them in HAMSTERSnest. The environment is a room of size 14 m x 13.3 m, with various obstacles(seefigure 7.11) . Each pieceof food is a cylinder (diameter 30 COl, height 70 cm) that slideson the floor when . The nest is located in a comer of the room. The pushed by HAMSTER target behavior can be decomposedas follows: Hoard Food = (Leave Nest . Get Food . Reach Nest ) * + Avoid Obstacles .
7.4.3 Specification
' Two rigid metal bars have been adapted to the Robuter s front so that food cylinders do not slip to either side when they are pushed. A frontal
7 Chapter
. . food . food "
" . HAMSTER
162
F1g1 D'e7.11 '
HAM STER s environment . . to sense proximity sensor, based on the frontal sonars, allows HAMSTER whether an object in front is far away , fairly close, very close, or at contact . The food cylinders are wrapped in violet paper, and the nest' s position is marked by another cylinder (diameter 30 cm , height 130 cm ) wrapped in pink paper . HAMSTERuses a frontal color camera to identify the position of food cylinders and of the nest, which are distinguished on the basis of color . Moreover , the nest sensor exploits an odometer (that is, a sensor that estimates the robot ' s position and heading ) to identify the approximate position of the nest when this is not visible . A main concern of the HAMSTERproject was to combine learned and innate behavior modules . We chose to program Avoid Obstacles directly , implementing a potential -based avoidance mechanism that ' exploits Robuter s sonars ( see, for example , Latombe 1991) . The remaining part of the hoarding behavior , that is, the food -fetching cycle, can be analyzed as a pseudosequence, see section 6.2. This means that at each control cycle there is enough information coming from the sensors to decide which of the three subbehaviors should be active : 1. Leave Nest : When the robot is in the nest: leave previously captured food in the nest, and then leave the nest. 2. Get Food : When the robot is out of the nest and no food is captured : get a piece of food .
163
Behavior Analysis and Training Methodology
Coordinator ~ 1 "
' Avoid Nest
Reach
Food
Get
Nest
Leave
Obstacles "
"
t
~
t
t
~ I
I
I
I
Environment
Figure7.12 Controller architecture for HAMSTER 3. Reach Nest : When the robot is out of the nest and a piece of food is . captured: reach the nest, pushing the piece of food. We adopted a hierarchical controller architecture, with four behavioral modules at level I , and a coordinator module at level 2. Each level- l module proposes a direction for the robot ' s movement. The coordinator choosesone of the moves proposed by Leave Nest , Get Food , and Reach Nest on the basis of the current situation, which is then combined with the move proposed by Avoid Obstacles (seefigure 7.12) . In general, this combination amounts to a weighted vector sum of the moves proposed by Avoid Obstacles and by the rest of the system. There is, however, a more complex case. When HAMSTER . is nearing a of food the front of the Avoid Obstacles behavior has to be , piece part inhibited; otherwise, the food could never be captured. The Get Food and Reach Nest behaviorslearn to inhibit the front sonarsas needed.
7.4.4 Training HAMSTER was trained through a modular shaping policy; that is, each learned level- l module was trained separately and frozen. The coordinator was then trained to achievethe target behavior. We decided to use very simple reinforcement programs ( RPs) . For example, the Reach Nest behavior was rewarded if and only if the distance . Training the robot while obstacle avoidance from the nest decreased was active would have forced us to embed a model of obstacle
164
7 Chapter
. _
-
. . .
. -
.
.
.
FiI UJ' e7.13 Threeexperimental situations . Rectangularshadedareasareobstacles ; blackcircles are food pieces . HAMSTER is shownat its initial location, that is, in nest (shadedcircle). avoidance into the RP, which would be unable to evaluate HAMSTERS moves properly without knowing the effects of obstacleson the robot. But such an RP would have been very complex. To keep the RP simple yet eliminate the interferenceof the innate Avoid Obstacles behavior, training therefore had to be carried out in an obstacle-free environment. Becausew~ had no such physical environment available, all training was carried out in a simulated environment, and the resulting modules were then transferred to the real robot. The transfer of the learned controller from a simulated environment turned out to be an interesting in option its own respect. 7.4.5
Aaeameot
The degree of learning was evaluated through the local performance indexesof the basic behaviors, computed in the simulated environment with no obstacles. As expected, we obtained reasonably high values(from 0.75 to 0.85, dependingon the specificbehavior and on different RPs we have experimented with ) . We did not compute the local performance index in the environment with obstacles becauseit would have been ; for example, during the Reach Nest behavior the robot meaningless would have been punished if it moved away from the nest in order to avoid an obstacle. We then transferred the controller to the real robot and ran some experimentsin the real environment (with obstacles). In this environment, it is meaningful to evaluatethe global performance, computed as the number of moves necessaryto accomplish the task, averagedover a set of sample situations. More precisely, we chose three different initial situations (seefigure 7.13), all including two piecesof food, and ran ten trials
165
Behavior Analysis and TrainingMethodology
Table7.2 Globalperformance indexof HAMSTER in realenvironment Mean Std. dev.
Situation a
Situation b
Situation c
365 .30 27.77
450 .00 127 .30
313 .50 34.33
for each of them, recording the total number of cyclesnecessaryto complete the hoarding of both piecesof food. Finally , we computedthe mean and the standarddeviation of suchdata was unable to accomplishthe task five (table 7.2) . In situation c, HAMSTER times out of ten due to the difficulty of getting the pieceof food in front of the initial robot position and then avoiding the close obstacle. For this situation, the data reported relate to the five successfultrials only .
7.4.6 Conclusions The HAMSTER showsthat while it is feasibleto implement a robot ' s project . controller starting from both innate and learned behavioral modules, to keep the RPs as simple as possible, it is necessaryto carry out training in environmentswherethe innate behaviorsare not active. In our case, this required that we carry out training in a simulated environment, and then transferthe learnedcontroller to the physicalrobot for the final assessment . Whether the global performanceof HAMSTER can be consideredacceptable is difficult to say. In a real application, someminimum performance level would havebeenestablishedin advance. In our case, we can only say that an informal observation of the behavior gave us the impressionthat HAMSTER was performing reasonably well.
7.S CASE3: THE CRABROBOTICARM For our final example of the BAT methodology, we used an industrial manipulator, namely an IBM 7547with a ScARA geometry (seefigure 1.6 for a picture) . In this section, we shall refer to this robot as " CRAB" (classifier-basedrobotic arm), a namethat indeedfits the shapeof its arm. CRABis a two-link robotic arm. The first link can rotate 200 degrees around the shoulder joint , and the second link can rotate 160 degrees , with respectto the first link , around the elbow joint . As a .result, the end effector, attached to the wrist, can cover the gray and the dashed areas shown in figure 7.14. The effectorsare two motors, acting on the shoulder
166
Chapter ?
and the elbow joints , that rotate the first link (with respectto the fixed base) and the secondlink (with respectto the first link ) . The target behavior we taught to CRABis a very simple one, namely, to have the end effector reach a still object (" food" ) placed in the gray area " of figure 7.14 (the " foraging area ) . In this case, the target behavior itself is a simple behavior, and therefore behavior analysisis trivial . The reasonwe were interested.in such a simple behavior is that, given the polar geometry of CRAB, ALECsyshas to learn behavior rules that are very different from the onescontrolling the mobile robots describedin the previous sections. In particular, the relation betweenthe actions (i.e., elementar rotations of the two links) and the resulting movementof the end effector involves complex trigonometric transformations. Therefore, it is not at all obvious that the results obtained with mobile robots can be made to bear on the presentcase. In CRAB'Senvironment, there is exactly one piece of food in the foraging area at any moment. Each time CRABreachesit , the piece of food is moved to a new random position within the foraging area. To be sensed by the robot , the pieceof food emits infrared light that can be receivedby an appropriate sensor. Deciding where to place such a sensorpresentsa simple, but interesting specificationproblem. A natural agent using arms to grasp objects would be endowed with two types of sensorycapacities: proprioception, the capacity to know the relative position of the limbs; and vision, the capacity to identify the relative direction and distanceof the object to be grasped. The problem with
,",,poo ' "'.0 ; , '~ , : . ' I , ' , a ()
andTrainingMethodolog Behavior ) Analysis
"
"
,
,
"
,
,
"
-
-
-
-
-
-
-
-
-
-
-
)
b
-
(
167
figure 7.15 -basedsensormustbe placedon endeffector:(a) positionof objectP Why sector cannot be distinguishedfrom position of end effector; (b) object P can be . with anydegreeof precision approached
can only deal with a very limited our artificial agents is that ALECSYS amount of input information , largely insufficient to determine relative distances and . angles of objects with the required degree of precision. Therefore, for both proprioception and infrared sensing we adopted sector-basedsensors , as we did with our mobile robots. However, becauseit cannot discriminate positions on the basis of distance , a sector-basedsensorplaced on the shoulder is structurally unable to locate an object precisely, (seefigure 7.15a) . To guide the end effector exactly to a given point , the sector-based sensormust be placed on the end effector (figure 7.15b). Even though the agent will still not be able to discriminate on the basisof distance, the end effector can be guided to the goal position simply by moving in its direction. As regards proprioception, simulation experiments showed that it is enough to have information about the elbow angle, that is, the angle between the first and the second link (Dorigo, Patel, and Colombetti 1994; Patel and Dorigo 1994; Patel, Colombetti, and Dorigo 1995). The sensorimotorapparatusof CRABwas then specifiedas follows: . A proprioceptive sensor provides information on whether the elbow , or between81 and 160degrees. angle is between0 and 80 degrees . An infrared sensor, placed on the end effector, tells the robot in which of eight equal sectorsthe food is located (all sectorsare of 45 degrees ). . Two motor effectors, one eachfor the elbow and shoulderjoints , rotate right (4 degreesfor both joints), rotate left (4 degreesfor both joints), or stay still.
168
Chapter
For our controller, we chosea monolithic architecture, given that the target behavior was made up of a single, simple behavior. The infrared sensorwe implementedallowed us to locate the food piece in the correct sector with good precision, but gave us a rather noisy estimate of the distance betweenthe food piece and the end effector. Note however, that information about this distanceis not usedby the learning agent but only by the RP to compute reinforcements. Each move was reinforced in proportion to the variation of the distancebetweenthe end effector and the food piece, as estimatedby the infrared sensor. To speedup learning in a typical experiment, we started from an initial rule baseobtained from a training sessionof 25,000 cyclesrun in a simulated environment and ran a supplementary training sessionof 2,500 cycleswith the real robot. At the end of this training session,we obtained a local performance index of 0.86, showing a limited increasein performance from the initial rule base obtained through simulated pretraining (a test run on the real robot with such an initial rule basegave us a performan of 0.83) . The result suggeststhat the supplementary training sessionon th~ real robot adapted the rule baselearned in the simulated environment to the more noisy physical sensor.
7.6 POINTSTO REMEMBER . Behavior analysis and training ( BAT) is proposed as a first example of a behavior engineeringmethodology based on the analysis of behavior, the integration of machine learning techniques with other aspects of robot design, and the independentassessmentof learning and of global behavior. . At leastin somecasesit is possibleto usebehavioral moduleslearnedin simulatedenvironmentsas a starting point for real robot training . . Learned behavioral modules can be integrated with hand-coded modules. . The results obtained from three practical cases show that the BAT methodology is sound, can be applied to robots of different types (like mobile platforms and robotic arms), and can lead to realizations of practical interest.
Chapter 8
FinalThoughts
8.1 A RETR
O S P Ecnv E OVERVIEW
In this book we have presented a new approach to the development of learning autonomous robots through behavior analysis, design, and training . We view our work as a first steptoward behavior engineering, an integrated disci' pline for the developmentof robotic agents. In particular, we have advocateda methodology centeredon robot shaping, that is, the instruction of agentsbasedon reinforcementlearning that exploits trainerproduced reinforcements. Our robots are strongly coupled with their environment through their sensorimotor apparatus, and are endowed with an " innate" (i.e., predesigned ) controller architecture that has the ability to learn from experience . The learning activity of our agentscan be viewed as a grounded (i.e., environment-sensitive) translation of an external specificationof behavior coded into a reinforcement-producing program, the RP. Although our experimental activity exploits learning classifier systems as a learning paradigm, we believe our results can be extended to any learning approach usedto solve reinforcementlearning problems. Indeed, the idea of using a trainer to provide step-by-step reinforcement is more basic than the specificlearning model adopted. For our experiments, we used both simulated and real agents. Simulated environmentshave proved very useful to test designoptions, and the resultsof simulations turned out to be robust enough to carry over to the real world without major problems. The experimentalactivity carried out so far allows us to make severalclaims: . Animatlike interactions in simple environments can be practically developed through step-by-step reinforcement learning. In general, we
170
Chapter 8
consistently obtained reasonably high learning rates across multiple experimentaltrials. . The kind of Animatlike behaviors we developedin simulation can be developedalso with real robots, provided the sensoryand motor capacities of the robots are similar to those of their simulated counterparts(see the experimentsin chapters5 and 7) . When the use of a real robot is too time-consuming, training in a simulated environment and then transferring the resulting controller to a real robot can be a viable alternative. . Fairly complexinteractionscan be developedevenwith simple, stimulusresponseS -R ) agents. In particular, behavior patterns that appear to follow a sequentialplan can be realized by an S-R agent when there is enough information in the environment to determinethe right sequencing of actions (see, for example, the Chase/ Feed / Escape experiment in section 4.4) . However, the addition of nonS -R elements, like a memory of past perceptions, can improve the level of adaptation to the dynamics of the environment. . The useof internal state (i.e., a state of the controller) allows our agents to learn simple sequentialbehavior patterns, a subtypeof the larger classof dynamic (i.e., non-S-R) behaviors. Experimentsare reported in chapter 6. . To developan agent, both explicit designand machinelearning have an important role. In our approach, the main designchoicesinvolve ( 1) sensors , effectors, and controller architecture of the agent; (2) the artificial objectsin the environment; (3) the sensorsand the logic of the trainer; and (4) the shaping policy. Learning is in charge of developing the functions implementedby the various modules of the agent controller. A comprehensive methodology is presentedin chapter 7.
8.2 RELATED WORK Our work situates itself in the recent line of researchthat concentrates on the realization of artificial agents strongly coupled with the physical world and usually called " embedded" or " situated agents." Paradigmatic examplesof this trend include Agre and Chapman 1987, Kaelbling 1987, Brooks 1990, 1991, Kaelbling and Rosenschein1991, and Whitehead and Ballard 1991. While there are important differencesamong the various approaches, some common points seemto be well established. A first, fundamental requirement is that agentsmust be able to carry on their activity in the real world and in real time. Another important point is that adaptive be-
Final Thoughts
havior cannot be consideredas a product of an agentin isolation from the world , but can only emergefrom a strong coupling of the agent and its environment. Our researchhas concentratedon a particular way to obtain such a coupling: the use of the LCS to dynamically develop a robot controller thr .<>ugh interaction of the robot with the world. In this section we relate our work to other approaches to the sameproblem developedin adjacent researchfields. Most prominent is the work on learning classifiersystems( Holland and Reitman 1978; Wilson 1987; Robertson and Riolo 1988; Booker 1988; Booker, Goldberg, and Holland 1989; Dorigo 1995). As seenin chapter 2, building on that work , we introduced the idea of using a set of communicating classifiersystems, running in parallel on an MIMD architecture. We also modified the basic learning algorithms to make them more efficient. More generally, our work is related to the whole field of reinforcement learning. Reinforcement learning has recently been studied in many different algorithmic frameworks. Besides LCSs, the most common frameworks are temporal difference reinforcement learning and related algorithms, like the adaptivecritic heuristics(Sutton 1984) and Q-learning ( Watkins 1989; Watkins and Dayan 1992), often implemented by connectionist systems(e.g., Barto, Sutton, and Anderson 1983; Williams 1992), and evolutionary reinforcement learning (e.g., Moriarty and Mikkulainen 1996). Thesedifferent approaches to reinforcementlearning are often overlapping. For example, the adaptive heuristicscritic and Qlearning have been implemented through a connectionist systemby Lin ( 1992); Compiani et al. ( 1989) have shown the existenceof tight structural relations between classifier systemsand neural networks; Dorigo and Bersini ( 1994) have shown the strong connectionsbetweenLCSs and Qlearning; and Kitano ( 1990), Whitley, Starkweather, and Bogart ( 1990), Belew, McInerney, and Schraudolph ( 1991), Nolfi and Parisi ( 1992), Schaffer, Whitley , and Eshelman( 1992), and Maniezzo ( 1994) have shown that neural nets can be learned by meansof an evolutionary algorithm . It happensvery often that the reinforcementlearning problems usedto illustrate and compare proposed algorithms are taken from the realm of autonomous robotics, but only a few of theseapplications deal with real robots. For example, Singh ( 1992), Lin ( 1992), and Milian and Torras ( 1992) use a point robot moving in a two-dimensional simulated world . Grefenstette's SAMUEL , a learning systemthat usesgenetic algorithms
172
Chapter 8
(Grefenstette, Ramsey, and Schultz 1990), learns decision rules for a simulated task (a plane learns to avoid being hit by a missile) . Booker' s GoFERsystem( Booker 1988) dealswith a simulated robot living in a twodimensional environment, whose goal is to learn to find food and avoid Poison. The choice of using a real robot makes things very different because many aspectsof the agent-environment interaction becometruly unpredictable and the efficiency of the systembecomesa major constraint. In the following sections, we will review applications of the above-mentioned classesof techniquesto the learning of autonomousrobot controllers. We will limit our attention to systemsthat , like ours, run on real robots.
8.2.1 Oa Bfier SystemReinforcementLearning
Donnart and Meyer ( 1996) have developed MonaLysa, an interesting learning architecture based on learning classifier systems, which they demonstratedin a real robot application. Although also basedon a hierarchical control in many . architecture, Mona Lysa differs from ALECSYS . Its control is of a of modules respects system composed greater variety (reactive, planning, context manager, internal reinforcement, and autoanalysi ) . Whereasin ALECsystasks are switched on or off by a learned behavioral module acting as a coordinator, in Mona Lysa tasks can be added or suppressedaccordingto a criterion of internal satisfactionof the control system. Mona Lysa has beendemonstratedon a goal-finding-plusobstacle-avoiding task; it is able to generateand store plans that help to avoid obstacleswhile reaching the goal, and to alter theseplans in reaction to environment modifications. Also , Mona Lysa uses a version of LCS in which the classifierscompletely cover the possible input - output mappings: no geneticalgorithm is used to searchfor useful rules. A somewhatdifferent approach is taken by Bonarini ( 1996), who uses ELF (a fuzzy version of LCS) to train a mobile robot to chaseanother mobile robot. In this application, ELF exploits a real-time simulation of the interaction to implement a version of " anytime learning" (Grefenstette and Ramsey 1992). 8.2.2 Temporal DifferenceReinforcementLearning and RelatedAlgorithms Temporal differencemethods, widely studied in the last few years, have also been applied to the control of real robots. Kaelbling used reinforcement " " learning to let a robot- called Spanky - learn to approach a light sourceand avoid obstacles(Kaelbling 1990; but seealso Kaelbling 1991).
173
Final Thoughts
She used a statistical technique to store an estimateof the expectedreinforcemen for each action and input pair and some information of how precisethat estimatewas. Unfortunately, only a qualitative description of the experimentsrun in the real world has beenreported. Kaelbling' s robot took from 2 to 10 minutes to learn a good strategy, while Autono Mouse II was already pointing toward the light after at most one minute. Still , the experimentalenvironmentswere not the same, and Spanky' s task was slightly more complex. One of the strengths of our approach is that it allows for an incremental building of new capabilities, somethingit is not clear Kaelbling' s approach allows. Mahadevan and Connell ( 1992) have implemented a subsumption architecture ( Brooks 1991) in which the behavioral modules learn by extendedversions of the Q-learning algorithm . Besidesusing a different learning approach, their work also differs from ours in that their behavior coordination strategyis not learned but directly programmed. Milian ( 1996) has proposed TESEO, a connectionist controller based on the actor-critic architecture ( Barto, Sutton, and Anderson 1983). ' TE SEO' s main goal is to overcomethree common li Initations of systems trying to solve RL problems: slow convergence , unsafeprocess, and lack of incremental improvement. It achievesthese results in a simple navigation problem by using a set of basic reflexes (i.e., predesignedreactive behaviors) and by learning new reactions from basic reflexes. It differs from ALEcsvs in many respects, the most important being its use of " " ' prewired behaviors ( basic reflexes in Millan s terminology) in a subsumption-like architecture. These basic reflexes allow the robot to start with an acceptablecontrol policy, which is later refined or augmented by learning. In the HAMSTER experimentpresentedin section 7.4, ALECSYS was used to learn to coordinate prewired and learned basic behavioral modules.
8.2.3 Evolutionary ReinforcementLearning
In the last few years, behavior-basedrobotics has proved to be an excellent test bed for evolutionary computation methods. Beer and Gallagher ( 1992) have usedgeneticalgorithms to let a neural net learn to coordinate the movements of their six-legged robot. Harvey, Husbands, and Cliff ( 1994) have evolved neural network controllers capable of producing a variety of visually guided behaviors. They also demonstratedthe principle of incremental evolution, helped by selecting tasks of increasing complexity (i.e., a population developedto learn a given task is used as the
174
ChapterS
starting population to solve a slightly more difficult task) . Floreano and Mondada ( 1996) has evolved a neural controller directly on the robot. They demonstratedthe evolution of a collision-free navigation behavior and a homing behavior. All theseapproachesdiffer from ours in that they use a geneticalgorithm to develop neural net controllers, instead of a set of behavioral rules. Koza and Rice ( 1992) used the geneticprogramming paradigm ( Koza 1992) to evolve Lisp programsto control an autonomousrobot. Although their experimentsare run in simulated environments, they are still interesting becausethey try to reproducethe results obtained by Mahadevan and Connell ( 1992); unfortunately, their useof a simulatedrobot makesa fair comparison very difficult. Although Koza and Rice use geneticalgorithms to develop a computer program, their approach is very different from ours: their genetic algorithm searches a space of Lisp programs, while ours is cast into the classifiersystemframework. Many studies of evolutionary reinforcement learning have taken the approach of developingthe controller in simulation and then transferring it to the reat robot. Examples are Nolfi et al. 1994, Grefenstette and Schultz 1994, and Jakobi, Husbands, and Harvey 1995. We also studied controller transfer from the simulated to the real robot (seesection 7.4) . Another interesting piece of work , Meeden 1996, compares connectionist reinforcement learning (based on backpropagation) and evolutionary reinforcementlearning (basedon geneticalgorithms), as a way to develop a neural network- based controller for a small toy car called " Carbot." The interestedreadercan find a nice review of recentwork on ( evolutionary reinforcementlearning in Mataric and Cliff 1996.)
8.2.4 Work on Shapingand Teachingin ReinforcementLearning The idea of shapinga robot is related to the work by Shepanskiand Macy (1987), who proposeto train a neural net manually by interaction with a human expert. In their work the human expert replacesour reinforcement program. Their approach, although very interesting, seemsdifficult to use in a nonsimulatedenvironment; it is not clear therefore whether it can be adopted for real robot shaping. Clouseand Utgoff ( 1992) proposea method in which a teacherobserves a reinforcement learner perfonnance and provides advice whenever the learner is judged not to progresswell. Utgoffand Clouse( 1991) proposea similar method, in which the teacher waits for the learner' s request of help. Both approaches differ from ours in that the teachersaysexplicitly
175
FinalThoughts
which action should have beenchosen, while our trainer only provides an immediate evaluation of the action chosenby the agent. Singh ( 1992) applies a shapingmethod to help a simulated robot learn composite tasks. In his approach, shaping meansletting a robot learn a successionof tasks, where each succeedingtask requires a subsetof the previously learned tasks plus a new elemental task. He ran experiments using compositional Q-learning, an extension of Q-learning that allows the construction of Q-valuesof a compositetask from the Q-valuesof the elementaltasks in its decomposition. Singh' s shaping differs from ours in that it focuseson the incrementalnature of learning (the robot learns new tasks building on its previous experience ) : there is no explicit trainer, and the meaning of the modulesthat make up his architecture is not designed but learned on-line. In Lin ' s teaching method ( 1992), the teacherhas to know how to generate a solution (" lesson" in Lin ' s terminology) to the problem confronting the learner. Becausethe designof such a teachercan be equivalent to being able to find the solution of the learning problem (or at least a good . ' approximation of the solution), Lin s teaching approach can be considered as a way to transfer knowledge from an expert trainer to an ignorant agent. But becauseour trainer doesnot have to produce a solution , only to evaluatemoves, it is easierto designthat Lin ' s. Thrun and Mitchell ( 1993) have developeda method that allows an agent learning by reinforcement to use prior knowledge in the form of neural networks capable of predicting the results of actions. Gordon and Subramanian( 1994) have proposedan evolutionary reinforcementlearning method basedon geneticalgorithms. A teacherprovides advice in the form of " if [conditions] , then [achievegoal] ." These rules are then made operational using the learning systembackground knowledge about goal achievement. Maclin and Shavlik ( 1996) have proposeda similar method, whose main differencelies in the use of connectionist Q-learning instead of a geneticalgorithm. All theseworks differ from ours in that their goal is to use domain knowledge and to integrate it in the knowledge representation systemof the learning agent. Caironi : and Dorigo ( 1994) have applied the same type of trainer (reinforcement program) proposed in this book to simulated robotic ' problemsusingQ-learning asthe learning algorithm . Caironi and Dorigo s work improves on previous work by Whitehead ( 1991a, 1991b), who also studied the problem of integrating delayedand immediate reinforcements in the Q-learning framework. Both theseworks differ from ours in that
176
ChapterS
experimentswereconductedin very simplesimulatedenvironments, where the robot had completeknowledgeof the state. Nehmzow and McGonigle ( 1994) have useda trainer to provide immediate reinforcement to their robot: they provide positive/negative reinforcem ' by covering/uncovering the robot s upward light sensors. Their work is a first exampleof shapingby a human trainer. Finally , Asada et aI. ( 1996) have developeda control systembasedon Q learning for a robot that learns to shoot a ball into a goal. To reduce "learning time, they usea method they call " learning from easymissions." The idea is to let the robot learn first in easy situations, then in incrementa more difficult ones. Obviously, this requires some sort of external supervisor, capableof ranking situations by difficulty and of choosing tasks of increasingdifficulty .
8.3 TRAININGVERSUSPROGRA : MMING The basic idea fosteredin this book is that training an agent can be better than programming it . As with all approaches having some innovative content, one can take an optimistic or a pessimisticattitude. Let us start with someoptimism. We have shown that a feasibleway to build the controller of a robotic agent is to have it learn on the basis of direct interaction between the robot , its environment, and a trainer. We have approachedthe learning problem by evolutionary computation and reinforcement learning with the help of a trainer. As far as evolutionary computation is concerned, the idea is that such an approach will eventually allow robot developersto obtain higherquality behavior than through direct coding ( Mataric and Cliff 1996) . The natural history of animal speciesshows that a processof fitness maximization can lead to highly complex and organizedbehavioral structures. In principle, there is no reasonwhy an artificial processshould not be able to mimic the natural developmentprocess,provided enoughcomputational resourcesare available. Of course, we all know that the amount of resources exploited by natural evolution to producecomplexorganismsis enormous. Thus engineersare likely to needadeus ex machina to shortcut the natural processby many orders of magnitude: in our work , we have exploited the designof the controller architecture and of the sensorimotorinterface. The amount of a priori design required by our approach is still extensive . This might be reducedby increasingthe useof evolutionary compu-
Final Thoughts -
-
177
III ligare 8.1 Snakelikemobile robot
tation techniques; at least in principle, the structure of the sensorimotor interface and the architectureof the controller could undergo a processof artificial evolution (see, for example, Cliff , Harvey, and Husbands 1993). However, we have not yet explored this possibility. Regarding the reinforcementlearning aspectof our approach, step-bystep reinforcement given by an RP might be regarded as a limitation becausethe d~signer needs to write the RP, which could at first sight appear as difficult as directly coding the control program. It is important to note, however, that the RP neednot be a control program in disguise. To seethis, considerthe problem of developinga light -following behavior for a number of different robots: wheel or track-basedvehicles(like the Autono Mice and HAMSTER ), a robotic arm (like CRAB ), and perhaps a snakelike robot like the one sketched in figure 8.1 (see, for example, Nilsson and Ojala 1995) . The interestingaspectof RP-basedtraining is that, at least in principle, it is possibleto train all such robots with the sameRP, that is, one that gives reinforcements proportional to the decreaseof distance between the robot ' s sensor and the goal. The resulting control programs will be sharply different, reflecting the different types of agent- environment interactions, which in turn depend on the different geometries of the agents. In this example, the effort required to invent the RP is virtually zero: that the distance between the robot ' s sensor and the goal should decreaseis simply a formal statement of the specificationof the light following behavior. On the other hand, nobody has to worry about the more or lesscomplex kinematics of the different robots: remarkably, this job is taken up by the learning process. A problem with the processsketchedabove, and in general with our training approach, is that it relies on much more information than
178
Chapter 8
processes based on delayed reinforcements, in which a reinforcement is given only when a goal is reached. Moreover, proponents of delayed reinforcement learning ( RL ) claim, and we agree with them, that most
Unfortunately, the current stateof understandingmakesit unfeasibleto use delayed RL to develop the control system of a real robot in area sonable amount of time. We have therefore adopted an approach that puts the burden of delayed reinforcement on the shoulders of a trainer program, which embodieshuman knowledgeof the task, understood as a pattern of agent-environment interaction. With respect to delayed reinforcem methods, there is a differencein the type and amount of input information that has to be consideredin order to evaluate the agent' s behavior: in general, a step-by-step RP requires more input information than a delayed RP. However, the goal of behavior engineeringis not to saveon input information but rather to exploit what input information is actually available, or can be made available at a reasonably low cost. A possibleextra cost of step- by-step reinforcementmight well bejustified by the higher learning speedthat is achieved. In conclusion, we believethat our approach basedon training is sound, useful, and compatible with the present directions of robotic research, which is strongly concernedwith the developmentof more complex types of behavior, so complex that they probably cannot be achievedthrough delayedRL . At present, however, there are somelimitations that must be overcome before training can become a standard robot development technique.
8.4 LIMITATIONS A first limitation of our approach is that it restricts the application of learning to basic and coordination behaviorsand presupposestraditional designof other important aspects, like the sensorimotorinterface and the architecture of the controller. This limitation is very common in current learning systemsfor nonsimulated autonomous robots (e.g., Mahadevan and Connell 1992). A further limitation is that our training technique relies on the availability of fairly sophisticatedsensors . It is true that some sensorycapacities are required only during training and are not needed for the performanceof the learned behavior. For behavior to be adaptive, how-
179
Final Thoughts
ever, it is desirablethat learning never be completely switchedoff , and this ' ' implies that full sensorycapacity (agent s plus trainer s) should be available for the whole life span of the agent. The availability of sophisticated sensorsat a reasonably low cost is thereforelikely to be a precondition for the successof robot shaping as a development technique. On the other hand, becausethe availability of powerful and low-cost sensorsis also likely to becomea precondition for the successof robotics in general, at least for applications in natural or almost natural environments, we expect that our requirementsfor the sensoryapparatus of robots will be compatible with future developmentsin applied robotics. Another limitation has to do with the complexity of behaviorsthat can be dealt with by the present version of ALBCsys. We feel that we have already exploited a system like ALBCsys almost at the maximum of its potentialities. Given that the size of the searchspacegrows exponentially with the amount of input information , there is no hope that this problem can be solved simply by using faster and faster hardware. We therefore need to develop a new tool capable of dealing efficiently with larger amountsof information . At present, our Animat -like tasksmake only soft demandson the sensorycapacitiesof the robot: the sensorswe use (or assumein simulations) can do little more than give binary information about the presenceor absenceof simple objects in certain regions of the environment. Realistic applications in complex environmentswill require " " the robot to recognizeand locate many types of passiveobjects, that is, objectsthat are not speciallyengineeredto be sensedby the robot. A further problem has to do with the development of dynamic behavior . Most of our experiments were aimed at developing stimulusresponsecontrollers. On two occasions, we went beyond this limit , but clearly the problem deservesa much wider and more systematicstudy. In particular, it is not clear to what extent more complex behaviorswill have to rely on internal models of agent- environment interaction; in other " " words, we do not know how cognitive a robot should be to perform tasks of practical interest in realistic environments.
8.5 THE FUTURE When thinking about the future developmentsof a line of research, it is reasonableto distinguish between what can and should be done in the near future and the more visionary intuitions of the far -off lands to which our efforts might finally bring us.
180
ChapterS
8.5.1 De Near Future
All the limitations discussedin the previous sectiondeservesomeresearch effort. One interestingdirection for future researchwould be to extend the application of learning techniquesto aspectsof robot behavior other than behavioral learning, in particular, perception and controller architecture. As far as perception is concerned, it would be desirableto use sensors that can physically adapt to the robot ' s environment. Given that research on evolvablehardware is still at an embryonic stage(but seeHiguchi et ale 1996), we expectthat in the near future sensoradaptation will have to be developedin simulated environments. The architecture of the controller, on the other hand, can, at least in principle, be dealt with satisfactorily at the level of software. This means that the structure of behavioral modules might emergefrom the learning process, instead of being predesigned. Another interesting extension to learning would be the integration of step- by-step training with delayed reinforcements(see Caironi and Dorigo 1994; Dorigo and Colombetti 1994b) . While we believe that some form of step-by-step reinforcement . will be necessaryto speedup learning at least in the initial stages, there is no reason why the actual achievementof final goals should not be a source of reinforcement for the agent. It would also be interesting to investigatewhether, at least in some applications, we can get completely rid of the step-by-step RP. Recent results in reinforcement learning and training (Shepanskiand Macy 1987; Clouse and Utgoff 1992; Nehmzow and McGonigle 1994) suggestthat the designof the reinforcementprogram, which currently requires a substantial effort, could be replaced by direct interaction with a human trainer. In the future, this possibility could be compared with another interesting option, that is, the description of the target behavior through somekind of high-level, symbolic language. As a whole, we believethat our work showsthe importance of learning to achieve a satisfactory level of adaptation between an artificial agent and its environment. Clearly, much further researchis neededto understand whether our approach can scaleup to a complexity comparable to the adaptive behavior of even simple living organismsand existing preprogra robots. In particular, we believe that it is time to face the problem of dealing with continuous input and output data, which would allow a much more satisfactory and reliable coupling betweenthe agent and its environment. As regardsthe BATmethodology, we expectit to remain fairly stablein the future. This will allow us to designand implement a number of soft-
Final Thoughts
ware tools to help designersin robot development . Tasksthat strongly needsuchtools arebehavioranalysisand specification , in particular, the of sensors . specification that deservea deeperanalysis Finally, aspectsof behaviorengineering are the conceptof quality of behavior, and the relatedissueof behavior assessment , Smithers1995 (see, for example ). An appropriateset of behavior metricswill haveto bedeveloped if this areais to find its way into thefield of industrialapplications .
8.5.2 Beyondthe Horizon
It seemsto us that artificial agentswill more and more exhibit a double nature: that of artificial systems(i.e., systemsdesignedby humans), and that of quasi-biological agents (see, for example, Steels1995). The very notions of environment, behavior, and the like obviously derive from biology; the sameis true for many computational models widely adopted in the realization of artificial agents, like neural networks and evolutionary computation. Most important, the key concept for agent design seemsto us to be that of adaptiveness , again mutuated from biology . " " means " " Literally speaking, adaptive capable of adaptation. Contrary to the term adaptable, which has both active and passive acceptations " " " " (meaningboth able to adapt and able to be adapted ), adaptive has an inherent active flavor: for example, an adaptive controller would be one that can tune its parameterson line to control a time-variant process . Biologists distinguish among different kinds of adaptation. ! The two kinds relevant for us here are evolutionary (or phylogenetic) adaptation, the processby which speciesadjust to environmental conditions through evolution, and ontogeneticadaptation, the processby which an individual adjusts to its environment during its lifetime. As far as behavior is concerned , ontogeneticadaptation is a result of learning. Let us first analyzeevolutionary adaptation. Many researchers , including us, have shown that evolutionary computation is a suitable paradigm to develop the behavior of artificial agents. In practical applications, artificial evolution mingles with individual learning in that reproduction only occurs at the software level; evolutionary procedurescan thus be viewed as a classof machine learning methods. What characterizesevolutionary methods is the presenceof discovery operators, like mutation and crossover, and a processof selectionbased on an evaluation of the ' agents behavior. Given the analogy with natural evolution, the result of suchevaluation is usually called " fitness," a term that in biology denotesan
182
Chapter 8
' organism s effectivenessin spreadingits own geneticmakeup. Depending on the approach adopted, fitness can apply to different entities: to the whole behavior exhibited by the agent, to a single behavioral module, or to a highly specific behavior rule. In all cases, however, fitness will always be a function of the actual behavior produced by the agent in its environment. Given that an individual has high fitness if it produces successfuloffspring , it is almost tautological that the natural world is mainly inhabited by organismswith high fitness. The interesting problem for biologists is to find out which features of the individual essentiallycontribute to its overall fitness (or , in other words, which features have high adaptive value). In the realm of artificial agents, by contrast, the relationship betweenfitnessand reproductive successis reversed:first , the fitnessof an individual is computed as a function of its interaction with the environment ; and second, the fittest individuals are causedto reproduce. This pattern is not exclusiveto artificial systems; it is also applied by breedersof cattle, horses, dogs, and other animals to produce breedswith predefinedfdatures. Indeed, we believe that the best metaphor of evolutionary computation is not biological evolution, but breeding. This change of perspectiveis not without consequences . For example, it is typical of that features can be selected: the survival of breeding highly disadaptive bred animals is often more pure problematic than that of mongrels. Analogously, applying evolutionary methods to the developmentof taskoriented behavior may easily result into a robot with a low survival capacity. In both animal breeding and evolutionary robotics, the fault is not in the evolution process; rather, it stemsfrom the definition of a fitness criterion that is not sufficiently comprehensive . This suggeststhat , as far as possible, the fitness function used to develop robot behavior should take into account the overall quality of the agent, and not performance alone- thus leading to the intuition that quality should be regardedas the artificial counterpart of biological fitness. An important differencebetweenthe natural and the artificial is that, in animals, learning is sharply different from evolution. Moreover, the capacity of individual learning in organismsis itself a product of evolution ; we must therefore expect to find such a capacity where it has adaptive value. In the realm of the artificial , on the other hand, there is no sharp conceptual separation between individual learning and evolution. In our experiments, for example, we have implemented individual learning of new behaviorsin terms of the evolution of a population of behavior
183
FinalThoughts
rules: the ontogenetic adaptation of our agentsis actually the result of a sort of phylogeneticadaptation taking place at a finer grain size. Machine learning techniquescan achieve adaptation in two basically different ways: parameter setting and structural learning. In AL Ecsys, for example, the bucket brigade algorithm is used to compute classifier strength (parameter setting), and the genetic algorithm is used to create new classifiers (structural learning) . Many weIl -known learning algorithms , like backpropagation and simulated annealing, can only be used for parameter setting, whereasevolutionary computation methods, like geneticalgorithms, are widely used both for parameter setting and structurallearning . There is no doubt that automatic parameter-setting methods are going to be useful, if not essential, for developing artificial agents, because agentswiIl have to adapt to a world that designerscan model only with some approximation. For this purpose, a variety of known learning methodscan be used, including evolutionary strategies. The most stimulating applications of evolutionary strategies, however, . are those intended to bring about new structures, such as behavior rules, neural network topologies, articulation of a systeminto modules, and the like. We shall therefore direct our attention to such applications. From the point of view of structural design, the main interest of evolutionary computation is that it allows us to deal with a larger design . If we call the conceptualspacethat is searchedby human designers space " " rational design, we can say that evolutionary strategiessearch in the " "2 spaceof nonrational design. The relevant questionis then, is there any practical justification to searchsucha space? Or , in other words, what are the limitation of rational design that we would like to overcome by enlarging the spectrum of possibilities? Before we try to answer this question, we needto better characterizehuman design. We think that the main qualifying aspectof rationally designedsystems is that they are modular. Modularity is what makesa complex systemeasy to conceive, understand, modify, and so on: divide and conquer is perhaps the most fundamental engineeringprinciple. A systeoi is modular when it is built from subsystemsthat are both highly cohesiveand loosely coupled (see, for example, Ghezzi, Yazayeri, and Mandrioli 1991). High cohesion is an intramodular property: it meansthat the subsystemhas a well-identified function and contains only componentsthat operate to realize such a function. Loose coupling is an intermodular property: it means that the complexity of the interfaces
184
ChapterS betweensubsystemsis small with respectto the internal complexity of the modules. The virtues of modularity are manifold. Modular systemsare easierto understand, maintain, and modify ; in general, they are lessprone to errors and easierto debug. It is important to note, however, that theseproperties are not directly related to the intrinsic performance of the system: in principle, a highly modular systemmay be lessefficientthan a lessmodular counterpart, although most nonmodular systems,becausethey havegrown in a chaotic way, often behaverather poorly . The advantagesof modularity concern the relationships between an artificial system and the human beingsinvolved in its design, implementation, and maintenance. According to classical engineeringmethodologies, lack of modularity is a shortcoming. As regardsnature, there is often sharp disagreementas to how modular biological systemsare (see, for example, Fodor 1985). While nobody would claim that natural systemsshow no articulation into , such subsystemsdo not seemto be structured in a modular subsystems at least way ( .under the definition of module that we gave above) : it is almost a truism that living organisms tend to act as a whole, which implies very strong coupling among subsystemsand thus a low degreeof modularity . Given our previous considerations, the lack of modularity of natural systemsshould come as no surprise. Indeed, observingnatural systems, it seemsreasonableto assume, at least as a working hypothesis, that their high degreeof fitness might profit from their nonmodular organization. Clearly, nature optimizes fitness, not understandability by humans! That a high degreeof modularity is not always a desirable(or possible) choice can be clearly seenby analyzing artificial intelligence programs. Much of artificial intelligence, and in particular knowledge engineering, could be viewed as the engineeringof low-modularity software systems. Consider the programming paradigm of production systems: the interactions among production rules in any large production systemare strong and, as any artificial intelligence programmer knows, often difficult to control. That the royal road of modularity is abandonedmainly by those who aim at building " intelligent" systemsis indeed an interesting fact. Even if nonmodular systemscan be very effective, to work properly they have to solve a number of hard problems, most notably: . Redundancyof input information. While modular systemsimplement a rational preselectionof useful input , nonmodular systemsoften have to deal with highly redundant input information .
185
FinalThoughts . Interferenceamongsubsystems . In nonmodular systems, the interaction among subsystemscan be very complex. This may lead to useful emergent properties, but also to improper behavior. . Difficulty of focusing action. Nonmodular systemsmay find it difficult to focus their action in a coherentway. We should be able to identify specific solutions to these problems in natural agents. In our opinion, this is preciselythe role of such important biological mechanismsas attention, inhibition, and motivation: attention dealswith highly redundant input information ; inhibition solvesconflicts ; and motivation focusesthe arising from interferenceamong subsystems 's toward . system activity specificgoals In modular systems, thesemechanismsare relatively marginal. Indeed, attention, inhibition , and motivation are not recognized as important architectural conceptsin traditional software design. On the other hand, complex concurrent systemsdo make use of mechanisms, like interrupts, that are clearly related to inhibition ; moreover, inhibition plays a fundamental role iq the subsumption architecture, proposed by Brooks ( 1991) as a basisfor behavior-basedrobotics. Again , it is interesting to look at production systemsthat have a low -type ( Brownston et degreeof modularity . Production systemsof the oPSS al. 1985) indeeddeal with the problemswe have pointed out. For example, the " recencyprinciple" can be viewed as a mechanismto focus the inter' preter s attention toward recently acquired information ; inhibition among production instancesis virtually implemented in the selection strategies applied to the conflict set; and context-basedmetaprogrammingis analogous to a motivation mechanism. Examplesof nonmodular artificial systemsare the exception, however, not the rule. In spite of the many interesting nonmodular ideas from artificial intelligence, the divide-and- conquerstrategyhas strong and valid rationality motivations. Is there any concreteneed to depart from such a principle? Our opinion is that we cannot always be perfectly rational. Rationality presupposesknowledge: where we do not know enough, no rational methodology can lead us to the right decisions. And the developmentof an autonomous agent is a striking example of a case where we do not know enough: the real world is too complex and unpredictable to allow for exhaustivemodeling. Thus far , we have experimentedwith very simple agents in fairly stable environments, which has allowed us to rely on rationality in the designof our robots. But the analysiscarried out in this
186
Chapter 8
book leadsus to expectthat such an approach cannot smoothly scaleup to really complex agentsinteracting with natural environments. Of course, we have also to keep in mind that a methodology for behavior engineeringmust be technologically feasible. Evolutionary computation is highly time- consuming; given that physical robots move rather slowly, such computation should be carried out in simulated environments as far as possible. On the other hand, strong adaptation to a real environment can only be achieved in the environment itself, given that simulation environmentsare too idealized. This dilemma must be resolved in order to work out a feasibleengineeringmethodology. One possibleway out, and one that we have already started to follow see the CRABexperiment in section 7.5), is to develop artificial agents ( in simulated environments as far as possible, and then to refine them through learning in the real world . How far this approach can be pursued, however, remains an open question.
Notes
Chapter 1 I . In this book, we take the term reactive agent to be a synonym of stimu/usresponseS-R) agent. With this definition, we do not consider any agent whose controller is dynamic (i .e., has an internal state) to be reactive. It is also common to define a reactive agent as the opposite of a deliberative agent, which is able to plan its actions according to an explicit representationsof its goals. Our useof the term reactiveis more restrictive. For example, we do not consider an agent with some memory of the past to be reactive, even when the agent is not deliberative (seeSection4.2.3, and chapter 6) . Chapter 2 1. Other well-known implementationsof Holland ' s LCS are Booker 1988, Booker, Goldberg, and Holland 1989, Robertson and Riolo 1988, and Wilson 1994. 2. The bucket brigade algorithm belongsto the classof temporal differencealgorithms ' (Sutton 1988) and bears much resemblanceto Watkins s Q-Ieaming, as discussed , for exaD1ple, by Dorigo and Bersini ( 1994) and by Wilson ( 1994). Most recentresearchon LCSs has shown that the use of a version of the bucket brigade closer to Q-Iearning bearsbeneficialeffectsto the LCS learning performance(see, for example, Wilson 1994). 3. A solution is also called an " individual ," and in LCSs an individual is a " classifier" see ( terminology box 2.2). 4. The GA usedin LCSo is a panmictic GA , that is, parent classifiersare chosen among all the classifiersin the population. This approach has been recently criti cized by Wilson ( 1996; but seealso Booker 1982), who proposesto recombine only classifiersbelonging to the sameaction set. The rationale here is that classifiers in a sameaction set are likely to be relevant to the sameor similar problems, and therefore recombination among them could have a higher probability to generatenew useful classifiers. Using this restrictive mating for the GA , together with a different way to evaluateclassifiersusefulnessbasedon accuracyinstead of strength, Wilson has recently designedXCS, which is probably, up to now, the best-performing LCS ( Wilson 1995, 1996) .
188
Notes to Pages33- 148
S. It couldseemstrangeto call a swn of strengths" energy ." In point of fact, we " " because thedefineddimensionhassomeof the propertiesof an adopted energy energy(e.g., it is additivewith respectto subsystems ). On theotherhand, wechose " " not to changethe well-established , to indicatethe utility of a usage strength classifier . 6. ALBCSYS is implementedin parallel C and Express, and runs on Quintet boards inserted in PCs (A Ts or better) . The simulation environment is written in C and runs on the host computer (a PC) .
Chapter4 1. Questionsabout representation in , rulesformat, and so on will be discussed section4.3. 2. Rememberthat we restrictedthe visual field of the AutonoMouseto two , as opposedto the standard180-degreeoverlapping overlappingconesof 60 degrees cones . , to makethe light exit thevisualfieldmorefrequently 3. AutonoMouse's performancehas beendecomposed by computingthe ratio (numberof timesAutonoMouseis rewardedwhenthe light is not visibledivided by numberof timesthe light is not visible) for figure4.5, and analogouslyfor figure4.6. The two performance graphshavethenbeeninterpolatedand plotted with respectto theactualcyclenumberto obtaincomparable graphs. 4. Because motor messages are six bits long, this impliesthat threebits in the . actionstringareuseless 5. Rulesin LCS-chaseare 15 bits long, in LCS-hide 18 bits long, and in LCSswitch6 bits long. ChapterS 1. If thereis correctmapping and the AutonoMouseare , we say that ALECSYS " well calibrated ." " malfunctionsthat makethe behaviorof a sensoror motor 2. We call " lesions " " differ from designspecifications , as distinguishedfrom noise, systematically whichin our experiments couldbe modeledasa Gaussiandistributionarounda meanbehaviorthat meetsthedesignspecifications . 3. In this andin the followingexperiments . , 100cyclestook about60 seconds 4. For example the could the at the side of Autono , keep light right experimenter MouseII to teachthe robot to turn right. S. In this experimentthe performance indexwasgivenby the light intensityseen the frontal when Autono Mouse n hadto moveforward, andby the light by eyes . intensityseenby the reareyeswhenthe robot had to movebackward Qapter7 1. In general, further aspects i of qualitywill beinvolvedin anypracticalapplication.
189
Notes to Pages149- 181 2. The differencebetweenadaptivenessand robustnesscan be stated as follows: an agent is adaptiveif its controller is capable to changeonline its characteristicsin sucha way to make its behavior better matchedto a changingenvironment, while an agent is robust when its controller is able to maintain a reasonablelevel of performancein a changing environment without changing its own structure. 3. The name HAMSrER (Highly AutonomousMobile systemtrain Edby Reinforcements ' ) is mainly justified by the robot s target behavior. Chapter 8 in The Oxford Companionto Animal 1. See, for example, the entry ADAPTATION Behaviour( McFarland 1981) . 2. We prefer the term nonrational designto i" ational designbecausein everyday languagethe latter has a negativeconnotation, while the former is more neutral.
References
, D. 1987. Pengi: An implementationof a theory of Agre, P. E., and Chapman on Artificial Intelligence the SixthNationalConference . In Proceedings of activity (AAAI-87) . MorganKaufmann. Arkin, R. C. 1990. Integratingbehavioral , and world knowledgein , perceptual . In P. Maes, ed., Specialissueon designingautonomous reactivenavigation 6 ( 1- 2): 105- 122. . RoboticsandAutonomous Systems agents Asada, M., N6'>da , S., and Hosoda, K . 1996. Purposive , S., Tawaratsumida behavioracquisitionfor a real robot by vision-basedreinforcementlearning. MachineLearning23 (2- 3): 279- 303. Barto, A. G., Bradtke, S. J., andSingh, S. P. 1995.Learningto actusingreal-time 72 (1- 2): 81- 138. . Artificial Intelligence dynamicprogramming Barto, A. G., Sutton, R. S., andAnderson , C. W. 1983.Neuronlikeadaptiveelements on . IEEE Transactions that can solvedifficult learningcontrol problems 13(5): 834- 846. , Man, andCybernetics Systems . Artificial on autonomousagents Beer, R. D. 1995.A dynamicalsystems perspective 215 . 1 2 : 173 72 ) Intelligence ( Beer, R. D., and Gallagher , J. C. 1992.Evolvingdynamicalneuralnetworksfor . behavior AdaptiveBehavior1 ( 1): 91- 122. adaptive : Belew, R. K ., McInerney , NN . 1991. Evolvingnetworks , J., and Scbraudolph the . In with connectionist of learning Proceedings Using the geneticalgorithm . on Artificial Life. Addison-Wesley SecondInternationalConference . . PrincetonUniversityPress Bellman,R. 1957.Dynamicprogramming . Bertoni, A., and Dorigo, M. 1993. Implicit parallelismin geneticalgorithms . 2 : 307 314 61 ( ) Intelligence Artificial . : Deterministicandstochasticmodels Bertsekas , D. 1987. Dynamicprogramming -Hall. Prentice . In Bonarini, A. 1996. Learningdynamicfuzzy behaviorsfrom easymissions '96. da. Qp ~ Ediciones Sur de IP MU , Proyecto Proceedings of . Booker, L. B. 1982.Intelligentbehaviorasan adaptationto thetaskenvironment Ph.D. diss., Universityof Michigan, Ann Arbor.
192
References
Booker, L. B. 1988.Classifiersystems that learninternalworld models . Machine Learning3 (3): 161- 192. Booker, L. B., Goldberg,D. E., and Holland, J. H. 1989.Classifiersystems and . Artificial Intelligence 40 (1- 3): 235- 282. geneticalgorithms Brooks, R. A. 1990.Elephantsdon't play chess . In P. Maes, ed., Specialissueon . RoboticsandAutonomous 6 (1- 2): 3- 16. designingautonomous agents Systems Brooks, R. A. 1991. Intelligencewithout representation . Artificial Intelligence 47 (1- 3): 139- 159. Brownston , L., Farrell, F., Kant, E., and Martin, N. 1985. Programming expert in OPS5 : An introductionto rule-basedprogramming . Addison-Wesley . systems Caironi, P. V. C., and Dorigo, M. 1994. Training Q-Agents. TechnicalReport IRIDIA / 94- 14. Universit6Libre de Bruxelles(Brussels ). Camilli, A., Di Meglio, R., Baiardi, F., Vanneschi , M., Montanari, D., andSerra, R. 1990.Para1leli72azione di Sistemia CI~~~ifi~ tori su architettureMIMD . Progetto finali~ ~to sistemiinformaticie calcoloparallel0, Sottoprogetto3 (architetturepara11ele ), TechnicalReport3-17, CNR, Italy. Oiff , D., and BullockS . 1993 . Adding " foveal vision" to Wilson's Animat. AdaptiveBehavior2 (1): 49 72. aifr , D., HarVeyI., and Husbands , P. 1993. Explorationsin evolutionary robotics.AdaptiveBehavior2 (1): 73- 110. Oiff , D., and Ross, S. 1994. Adding temporarymemory to ZCS. Adaptive Behavior3 (2): 101- 150. Oouse, J. A., and Utgoff, P. E. 1992. A teachingmethodfor reinforcement on MachineLearning . Morgan learning. In Proceedings of the Ninth Conference Kaufmann. Cohen,P. R. 1995.Empiricalmethods . MIT Press . for Artificial Intelligence Colombetti M. M. and 1993 . to control an autonomous robot , , Dorigo, Learning distributed In . J. A. H. and S. W. Wilson , Roitblat , , by geneticalgorithms Meyer eds., Proceedings InternationalConference on , FromAnimalsto Animats2: Second Simulationof AdaptiveBehavior( SAB92). MIT Press . Compiani,M., Montanari, D., SerraR., andValastroG. 1989.Classifiersystems and neuralnetworks . In E. R. Caianiello andneural , ed., Parallelarchitectures networks . World Scientific . Cramer, N. L. 1985 . A representation for the adaptivegenerationof simple . In Proceedings on sequentialprograms of the First InternationalConference GeneticAlgorithmsandTheirApplications . Erlbaum. Dickmanns , D., Schmidhuber , J., and Winklhofer, A. 1987. Der genetische , Algorithmus: Eine Implementierungin Prolog. Fortgeschrittenenpraktikum Institut fiir Informatik, LehrstuhlProf. Radig, Technische Universitit, Munich. Donnart, J.-Y., andMeyer, J.-A. 1996.Learningreactiveandplanningrulesin a Animat. In M. Dorigo, ed., Specialissueon learning motivationallyautonomous
193
References
Part B on Systems robots. IEEE Transactions autonomous , Man, andCybernetics 26 (3): 381- 395. of a mobilerobot: Integrationof , A. 1993.Development Dorigo, F., andMaesani in Italian machine and , Dipartimento ). Master's thesis ( techniques learning design di Elettronicae Informazione , Politecnicodi Milano (Milan). Dorigo, M. 1992. Usingtransputersto increasespeedand flexibility of genetics Journal andMicroprogramming . Microprocessing basedmachinelearningsystems 34: 147- 152. . Evolutionary Dorigo, M. 1993. Geneticand non-geneticoperatorsin ALECSYs 1 (2): 151- 164. Computation . ALF . CSYs and the AutonoMouse: Learningto control a real Dorigo, M. 1995 . MachineLearning19(3): 209- 240. robot by distributedclassifiersystems robots. IEEE Transactions Dorigo, M., ed. 1996.Specialissueon learningautonomous B 26 3 : 361 364. Part and Man on Systems , , () Cybernetics Dorigo, M., and Bersini, H. 1994.A comparisonof Q- learningandclassifiersystems , . In D. Cliff, P. Husbands , J. B. Pollack, andS. W. Wilson, eds., Proceedings on Simulationof From Animalsto Animats3: Third InternationalConference . AdaptiveBehavior( SAB94). MIT Press : Developingautonomous . Dorigo, M., an" Colombetti,M. 1994aRobot shaping - 370. 71 2 : 321 . agentsthroughlearningArtificial Intelligence ( ) . The role of the trainerin reinforcement Dorigo, M., and Colombetti, M. 1994b ' on RobotLearning , New of MLC-COLT 94 Workshop learning. In Proceedings .edu/ afs cs .cmu www .cs .cmu .edu : at on Internet Available NJ. Brunswick / / , http // user/ mitchen/ftp/ robot-learning.html , U. 1991. Organisationof Robot Behaviourthrough Dorigo, M., and Schnepf es. In Proceedings GeneticLearningProcess of theFifth IEEE InternationalConference . Robotics( ICAR '91) . IEEE Press on Advanced machine based . Genetics U. 1993 M. and learningandbehavior , Schnepf Dorigo, , on Systems . IEEE Transactions , Man, and basedrobotics: A New Synthesis 23 ( 1): 141- 154. Cybernetics . CSYs : A parallel laboratoryfor learning Dorigo, M., and Sirtori, E. 1991. ALF on Genetic . In Proceedings classifiersystems of theFourthInternationalConference . MorganKaufmann. Algorithms Dorigo, M., Patel, M. J., andColombetti,M. 1994.Theeffectof sensoryinformation theFifth International on reinforcement learningby a robotarm. In Proceedings ' of . on RoboticsandManufacturing( ISRAM 94) . ASME Press Symposiwn . In Proceedin Dress,W. B., 1987.Darwinianoptimizationof syntheticneuralsystems . IEEE on NeuralNetworks of the First IEEE InternationalConference . Press Floreano, D., and Mondada, F. 1996. Evolutionof homingnavigationin a real mobilerobot. In M. Dorigo, ed., Specialissueon learningautonomousrobots. Part B 26 (3): 396- 407. on Systems IEEE Transactions , Man, andCybernetics
194
References Flynn, M . J. 1972. Some computer organizations and their effectiveness . IEEE Transactionon Computers21 (9) : 948- 960. Fodor, J. 1985. The modularity of mind. Focal article with commentary. Behavioral and Brain Sciences8: 1- 42. Fogel, D . B., Fogel, L . J., and Porto V . W. 1990. Evolving neural networks. Biological Cybernetics63 (6) : 487- 493. Franklin , J., Mitchell , T ., and Thrun , S., eds. 1996. Specialissueon robot learning . Machine Learning 23 (2- 3). Fujiki , C., and Dickinson, J. 1987. Using the genetic algorithm to generateLisp sourcecode to solve the Prisoner's dilemma. In Proceedingsof the SecondInternational Conferenceon GeneticAlgorithms. Erlbaum. Gaussier, P., ed. 1995. Specialissueon moving the frontiers betweenrobotics and biology. Roboticsand AutonomousSystems16 (2- 4) . Ghezzi, C., Jazayeri, M ., and Mandrioli , D . 1991. Fundamentalsof software . Prentice-Hall . engineering
, D. E. 1989. Geneticalgorithmsin search Goldberg , optimizationand machine . Addison-Wesley . learning GordonD ., aqd Subramanian . A multistrategylearningschemefor , D. 1994 . Informatica17: 331- 346. agentknowledgeacquisition Grefenstette , J. J., andRamsey , C. L. 1992.An approachto anytimelearning. In . Morgan Proceedings of the Ninth InternationalMachineLearningConference Kaufmann. Grefenstette , J. J., Ramsey , C. L., and Schultz , A. C. 1990. Learningsequential decisionrulesusingsimulationmodelsand competition . MachineLearning5 (4): 355- 381. Grefenstette , J. J., andSchultz , A. C. 1994.An evolutionaryapproachto learning ' in robots. In Proceedings on RobotLearning . New of MLC-COLT 94 Workshop Brunswick , NJ. . Seeingthe light: Artificial evolution, , P., andCliff D. 1994 Harvey, I., Husbands realvision. In D. Cliff, P. Husbands , J. B. Pollack, and S. W. Wilson, cds., Proceeding , From Animalsto Animats3: Third InternationalConference on Simulation . of AdaptiveBehavior( SAB94). MIT Press Hicklin, J. F. 1986 . Applicationof the geneticalgorithmto automaticprogram . Master's Thesis generation , Universityof Idaho. T. Iwata M. , , Kajitani, I., Tha, H., Hirao, Y., Furuya, T., and ManHiguchi, , derick, B. 1996.Evolvablehardwareandits applicationto patternrecognitionand fault-tolerantsystems . In E. Sanchez and M. Tomassini , cds., TowardsEvolvable : ~ n EvolutionaryEngineering Hardware . LectureNotesin Computer Approach Science1062.Springer . Hillis, W. D. 1985. Theconnection machine . MIT Press . Holland, J. H. 1975. Adaptationin naturaland artificial systems . Universityof . Reprint, MIT Press MichiganPress , 1992.
195
References
Holland, J. H. 1980.Adaptivealgorithmsfor discoveringand usinggeneralpatterns in growingknowledgebases . InternationalJournalof Policy Analysisand 4 (2): 217- 240. InformationSystems : The possibilitiesof generalpurpose Holland, J. H. 1986 . Escapingbrittleness rule . In R. S. Michalski, J. to learningalgorithmsapplied parallel -basedsystems G. Carbonell , andT. M. Mitchell, eds., MachineLearning11. MorganKaufmann. Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. 1986.Induction . MIT Press . : Process esof inference , learning , anddiscovery Holland, J. H., and Reitman, J. S. 1978 . Cognitivesystemsbasedon adaptive . In D. A. Watermanand F. Hayes-Roth, eds., Pattern-directedinference algorithms . AcademicPress . systems handbook . 2d ed. INMOS. 1989. Thetransputer N. P. I. Husbands and , , , , , Harvey 1995.Noiseandtherealitygap: Theuse Jacoby of simulationin evolutionaryrobotics. In Proceedings of the Third International . onArtificial Life. Springer Conference Kaelbling, L . P. 1987. An architecture for intelligent reactive systems. In M . P. Georgef, and A . L . Lansky, eds., Reasoningabout actions and plans. Morgan Kaufmann. Kaelbling , L . P~ 1990. Learning in embeddedsystems. PhiD. diss., Stanford University , Stanford, CA . Kaelbling, L . P., 1991. An adaptable mobile robot. In Proceedingsof the First EuropeanConferenceon Artificial Life . MIT Press. Kaelbling, L . P., Littman , L . M ., and Moore, A . W . 1996. Reinforcementlearning : A survey. Journal of Artificial IntelligenceResearch4: 237- 285. , S. J. 1991. Action and planning in embedded Kaelbling, L . P., and Rosenschein agents. In P. Maes, ed., Specialissueon designing autonomous agents. Robotics and AutonomousSystems6 ( 1- 2) : 35- 48. Kitano , H . 1990. Designing neural networks using geneticalgorithms with graph generationsystem. ComplexSystems4: 461- 476. Koza, J. R. 1992. Geneticprogramming. MIT Press. Koza, J. R., and Rice, J. P. 1992. Automatic programming of robots using genetic programming. In Proceedingsof the Tenth National Conferenceon Artificial Intelligence ( AAAI -92) . AAAI Press. Klose , B. J. A ., ed. 1995. Special issue on reinforcement learning and robotics. Roboticsand AutonomousSystems15 (4) . Latombe, J. C. 1991. Robot motion planning. Kluwer . Lin , L .-J. 1992. Self-improving reactive agentsbased on reinforcement learning, planning and teaching. Machine Learning 8 (3- 4) : 293- 322. Lin , L .-J. 1993a. Hierarchical learning of robot skills by reinforcement. In Proceedingsof the 1993IEEE International Conferenceon Neural Networks. IEEE Press.
196
References
Lin, L.-J. 1993b . Scalingup reinforcementlearningfor robot control. In Proceedin on MachineLearning . Morgan of the TenthInternationalConference Kaufmann. Lin, L.-J., and Mitchell, T. M. 1992. Memory approach es to reinforcement . TechnicalReportCMU-CS-92- 138. School learningin non-Markoviandomains of ComputerScience . , CarnegieMellonUniversity, Pittsburgh -basedcategorizationof reinforcement Littman, M. L. 1993. An optimization . In J. A. Meyer, H. Roitblat, and S. W. Wilson, eds., Proceeding learningenvironments , FromAnimalsto Animats2: SecondInternationalConference on Simulation . of AdaptiveBehavior(SAB92). MIT Press Maclin, R., and Shavlik, J. W. 1996. Creating advice -taking reinforcement learners . MachineLearning22 ( 1- 3): 251- 281. Maes, P. 1990. A bottom-up mechanismfor behaviorselectionin an artificial creature . In Proceedings : First InternationalConference , FromAnimalsto Animats on Simulationof AdaptiveBehavior( SAB90). MIT Press . Maes, P. 1994. Modelingadaptiveautonomousagents . Artificial Life 1 ( 1- 2): 135- 162.
Maes, P., and Brooks, R. A . 1990. Learning to coordinate behaviors. In Proceeding of the. EighthNational Conferenceon Artificial Intelligence ( AAAI -90) . AAAI Press. Mahadevan, S., and Connell, J. 1992. Automatic programming of behavior-based robots using reinforcementlearning. Artificial Intelligence55 (2) : 311- 365. Maniezzo, V. 1994. Genetic evolution of the topology and weight distribution of neural networks. IEEE Transactionson Neural Networks 5 ( 1) : 39- 53. Mataric , M . J., and Cliff , D . 1996. Challengesin evolving controllers for physical robots. Specialissueon evolutional robotics, Roboticsand AutonomousSystems19 ( 1) : 67- 83. McFarland , D ., ed. 1981. The Oxford Companionto Animal Behaviour. Oxford University Press. Meeden, L . A . 1996. An incremental approach to developing intelligent neural network controllers for robots. In M . Dorigo, ed., Special issue on learning autonomousrobots. IEEE Transactionson Systems , Man, and CyberneticsPart B 26 (3) : 474- 484. Michalewicz, Z. 1992. Geneticalgorithms+ data structures= evolutionprograms. Springer. Milian , J. del R. 1996. Rapid, safe, and incremental learning of navigation strategies . In M . Dorigo, ed., Special issue on learning autonomous robots. IEEE Transactionson Systems , Man, and CyberneticsPart B 26 (3): 408- 420. Milian , J. del R., and Torras, C. 1992. A reinforcement connectionist approach to robot path finding in non-maze-like environments. Machine Learning 8 (3- 4) : 363- 395. Mitchell , M . 1996. An introduction to geneticalgorithms. MIT Press.
197
References
Moriarty, D. E., and Mikkulainen, R. 1996. Efficientreinforcementlearning throughsymbioticevolution. MachineLearning22 (1- 3): 11- 32. in robotsby Nehmzow , U., andMcGonigle,B. 1994.Achievingrapidadaptations meansof externaltuition. In D. Cliff, R. Husbands , J. B. Pollack, and S. W. : ThirdInternationalConference Wilson, cds., Proceedings , FromAnimalsto Animats . on Simulationof AdaptiveBehavior( SAB94). MIT Press " in reinforcement Nilsson, M., and Ojala, J. 1995. " Self-awareness learningof -like robot locomotion.In Proceedings . snake of lASTED 95, Cancun.Acta Press Nol:fi, S., Floreano, D., Miglino, 0 ., and Mondada, F. 1994. How to evolve robots: Differentapproach esin evolutionaryrobotics.In Proceedings autonomous . onArtificial Life. MIT Press of theFourthInternationalWorkshop . In Proceedings Nolfi, S., and Parisi, D. 1992. Growingneuralnetworks of the . ThirdInternationalConference on Artificial Life. Addison-Wesley Nordin, P., and Banzhaf , W. 1996. A geneticprogrammingsystemlearning obstacleavoidingbehaviorandcontrollinga miniaturerobot in real time. Technical , Universityof Dortmund. Report4/ 95. Departmentof ComputerScience Patel, M. J., Colombetti, M., and Dorigo, M. 1995. Evolutionarylearningfor : A casestudy. IntelligentAutomationandSoft Computing intelligentautomation Joumall (I ): 29- 42. Patel, M. J., and Dorigo, M. 1994. Adaptivelearningof a robot arm. In Proceedin . AISB Workshop(Selected ) . Springer Papers of EvolutionaryComputing to solve . A parallelclassifiersystem Piroddi, R., and Rusconi, R. 1992 learning ' , , Dipartimentodi Elettronicae Infonnazione problems(in Italian). Masters thesis Politecnicodi Milano (Milan). es: Discretedynamicstochastic Puterman , M. 1994. Markovdecisionprocess programm . Wiley. . In Proceedin : 2. Default hierarchies Riolo, R. L. 1987. Bucketbrigadeperformance . Erlbaum. on GeneticAlgorithms of theSecondInternationalConference in learningclassifiersystems of defaulthierarchies Riolo, R. L. 1989.The emergence . on GeneticAlgorithms . In Proceedings of the ThirdInternationalConference MorganKaufmann. of geneticalgorithmsin a clasRobertson, G. G. 1987. Parallelimplementation on Genetic . In Proceedings sifier system of the SecondInternationalConference . Erlbaum . Algorithms . Machine Robertson, G. G., andRiolo, R. L. 1988.A taleof two classifiersystems Learning3 (2- 3): 139- 160. of digital machines Rosenschein , S. J., and Kaelbling, LP . 1986.The synthesis on . In Proceedings with provableepistemicproperties of the 1986Conference . MorganKaufmann. aboutKnowledge Theoretical Aspectsof Reasoning . . AcademicPress Ross, S. 1983.Introductionto stochastic dynamicprogramming Saffiotti, A., Konolige, K., andRuspini, EH . 1995.A multivaluedlogicapproach 76( 1- 2): 481- 526. to integratingplanningandcontrol. Artificial Intelligence \
198
References Saffiotti, A., Ruspini, E. H., and Konolige, K. 1993. Integratingreactivityand in a fD2ZY controller. In Proceedings goal-directedness of theSecondFuzzy-IEEE . IEEE Press . Conference Schaffer, J. D ., Whitley , D ., and Eshelman, L . J. 1992. Combinations of genetic algorithms and neural networks: A surveyof the state of the art. In Proceedingsof the International Workshopon Combinationsof Genetic Algorithms and Neural Networks ( COGANN-92) . IEEE Press. Shepanski, J. F., and Macy, S. A . 1987. Manual training techniquesof autonomous systemsbasedon artificial neural networks. In Proceedingsof the First IEEE Annual International Conferenceon Neural Networks. IEEE Press. Siegel, S., and Castellan, N . J. 1988. Nonparametricstatisticsfor the behavioral sciences . 2d ed. McGraw -Hill . Singh, s. P. 1992. Transfer of learning by composing solutions of elemental sequentialtasks. Machine Learning 8 (3- 4) : 323- 339. Skinner, B. F. 1938. The behavior of organisms: An experimental analysis. Appleton Century. Smithers, T . 1995. On quantitative performancemeasuresof robot behaviour. In L . Steels, ed., Special issue on the biology and technology of intelligent and autonomoussy~ s. Roboticsand AutonomousSystems15 ( 1- 2) : 107- 134. Sommerville, I . 1989. Software engineering . Addison-Wesley. Steels, L . 1990. Cooperation betweendistributed agentsthrough self- organization. In Y . Demazeau , and J.-P. Miiller , eds., DecentralizedAI . North -Holland. Steels, L . 1994. The artificial life roots of artificial intelligence. Artificial Life 1 ( 1- 2) : 75- 110. Steels, L ., ed. 1995. Specialissueon the biology and technology of intelligent and autonomoussystems. Roboticsand AutonomousSystems15 ( 1- 2) . Sutton, R. S. 1984. Temporal credit assignmentin reinforcement learning. PhiD. diss., University of Massachusetts , Amherst. Sutton, R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3 ( 1): 9- 44. Sutton, R. S. 1990. Integrated architecturesfor learning, planning, and reacting based on approximating dynamic programming. In Proceedingsof the Seventh International Conferenceon Machine Learning. Morgan Kaufmann. Sutton, R. S. 1991. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bulletin 2: 160- 163. Sutton, R. S., Barto, A . G., and Williams , R. J. 1991. Reinforcement learning is direct adaptive optimal control. In Proceedingsof the 1991 American Control . American Control Council. Conference Tani , J. 1996. Model-based learning for mobile robot navigation from the dynamical systems perspective. In M . Dorigo , ed., Special issue on learning autonomousrobots. IEEE Transactionson Systems . Man. and CyberneticsPart B 26 (3) : 421- 436.
199
References
Thrun, S., and Mitchell, T. 1993. Integratinginductiveneuralnetworklearning -basedlearning. In Proceedings and explanation of the ThirteenthInternational Joint Conference on Artificial Intelligence( IJCAI-93) . MorganKaufmann. Utgoff., P., andCouse, J. 1991.Two kindsof traininginfonnationfor evaluation on Artificial functionlearning. In Proceedings of the Ninth NationalConference . Intelligence(AAAI-91) . AAAI Press . PhiD. diss., University Watkins, C. J. C. H. 1989.Learningwith delayedrewards of Cambridge . . Technicalnote: Q-learning. Machine Watkins, C. J. C. H., and Dayan , P. 1992 4 8 3: 279 292 . Learning ( ) Whitehead . A study of cooperativemechanisms for fasterreinforceme , S. D., 1991a , learning. TechnicalReport TR 365. ComputerScienceDepartment . Universityof Rochester in . A complexityanalysisof cooperativemechanisms Whitehead , S. D., 1991b on Artificial reinforcement of theNinth NationalConference learning. In Proceedings . Intelligence(AAAI-91) . AAAI Press Whitehead , S. D., and Ballard, D. H. 1991.Learningto perceiveandact by trial anderror. MachineLearning7 (I ): 45- 83. . Reinforcement Whitehead , S. D., and Lin, L. J. 1995 learningin non-Markov 73 ( 1 2): 271 306. decisionprocess es. Artificial Intelligence D. T. and Stark weather , , , , Bogart, C. 1990.Geneticalgorithmsandneural Whitley . ParallelComputing14: networks : Optimizingconnectionsand connectivity 347- 361. Williams, R. J. 1992. Simplestatisticalgradient-following algorithmsfor connectionistreinforcement learning.MachineLearning8 (3- 4): 229- 256. WilsonS. W. 1985.Knowledgegrowthin an artificialanimal. In Proceedings of . on GeneticAlgorithmsand TheirApplications theFirst InternationalConference Erlbaum. and the Animat problem. MachineLearning WilsonS. W. 1987.Cassifiersystems 2 (3): 199- 228. WilsonS . W. 1990. The Animat path to AI . In Proceedings , FromAnimalsto on the Simulationof AdaptiveBehavior : First InternationalConference Animats . ( SAB90) . MIT Press . EvolutionaryComputation W. . ZCS : A zerothlevelclassifiersystem 1994 WilsonS . 2 ( I ): 1- 18. . EvolutionaryComputation WilsonS. W. 1995.Cassifierfitnessbasedon accuracy 3 (2): 149- 176. . Generalizationin XCS. Unpublishedcontributionto the WilsonS . W. 1996 ICML '96Workshopon EvolutionaryComputationandMachineLearning,Bari, Italy. Availableon Internetat http://netq.rowland.org/ sw/swhp.html
Index
Adaptation, 181 Adaptive heuristic critic , 8 Adaptive value, 23, 182 Advisor , 132 Agent embedded .#170
reactive , 5, 116- 119, 125 robotic, 1- 2, 14, 18, 169(seealsoRobot) situated , 170 . , 134Agent-to-trainercommunication 142 AI. SeeArtificial Intelligence ALECSYs, 11, 37- 42, 47- 53 Animat, 10 . See Apportionmentof creditsystem Learningclassifiersystem Architecture controller, 11, 45- 50. 145. 148. 169. 180 design, 51- 52
distributed , 47- 48, ~2 fiat, 47- 49, 51 hierarchical , 47, 49- 53, 55- 56, 75, 81, 85, 122, 163, 172 , 53- 55 implementation monolithic,47- 49, 75- 83, 95, 168 monolithicwith distributedinput, 48, 81, 83 switch, 50- 52, 75- 86, 157, 159 three-levelswitch, 50, 85- 86 two-levelswitch, 81, 83 Artificial Intelligence , 17, 184- 185 Auctionmodule , 25- 26 AutonoMouse, 10 II , 10, 53- 54, 96- 108, 173 IV , 10, 109- 114 V, 10, 111, 153- 160 real(seeAutonoMouseII , IV andV ) simulated , 61- 64 AutonomousRobotArchitecture , 15
BAT. SeeBehaviorAnalysisandTraining methodology Behavior , 1- 10 , 170, 180 adaptive , 147, 155, 166, 181 analysis , 15 approaching assessment . ISO, 159, 181
avoidance , 15 basic , 47, 53- 56 chasing , 15, 62, 73, 75, 78, 112 , 15, 45- 53 complex coordination , 49, 53, 55, 81, 159 , 179 , 130 , 14, 66, 69, 115 dynamic , 178 , 169 , 1, 3, 18, 143 engineering , 15, 74- 75, 78, escaping , 74 global - 162 , 118 , 161 hoarding , 52, 122 pseudosequential ) quality(seeQualityof behavior , 154 , 146 requirements , 157 , 112 searching - 120 , 52, 115 sequential - 147 , 7, 103 , 146 specification -responseS -R), 12, 14, 66, 69, stimulus 118 of, 45 structure - 149 structured , 147 - 146 , 145 target - 165 - 138 transfer , 164 , 141 , 159 , 134 typesof, 14- 15 andTraining Behavior Analysis , 144-168 methodology Behavior , 46 composition combination , 46 sum,46 independent , 46 sequence , 46 suppression module Behavioral , 11, 46, 148 - 164 AvoidObstacles , 161
202
Index Behavioralmodule(cant.) Chase , 50, 54- 55, 75- 82, 89, 91, 107- 108, 155- 156 Chase , 50, 53- 55, 170 / Feed / Escape coordinator , 51, 72, 54, 56, 82- 86, 122125, 132- 136, 138, 141, 163, 172 , SO , 58- 61, 75- 82, 89, 107 Escape Feed, 59- 51, 80- 88 FindHidden, 91, 110- 112, 155 GetFood , 161- 163 HoardFood , 161 Leave Nest, 161- 163 ReachNest , 161- 164 Search , 61, 89- 92, 112, 155- 156 Behavioralsciences , 1- 3 Breeding , 182 Bucketbrigadealgorithm, 9, 23, 28, 38, 183 Buildingblock, 32 Chromosome , 12, 32 C1M~ ~ , 9 set. 23
strength , 9- 10, 23, 26- 28, 34- 36, 58, 66, 69, 131 , 173 Conftictresolution , 25- 26 Connection Machine , 37
Controlpolicy, 9, 95 Coordinationaction, 53 Coverdetector , 33 Covereffector , 33 CRAB , 10, 165- 168, 186 Crossover , 31, 181 operator
Defaulthierarchy , 28- 30 Design nonrational , 183 rational, 183 Dyna, 9 Dynamic behavior(seeBehavior , dynamic ) , 69, 119, 122 system , 8- 9 Dynamicprogramming
ELF, 172 , 34, 67- 68 Energy Environment . 15- 16
Chase , 60- 64 Chase , 60- 63, 72 / Escape Chase , 60- 63, 79, 88 / Feed/Escape extended , 147- 148 FindHidden, 61- 64, 89, 91 hidden-state, 116 initial, 145- 148 Markov, 116, 121 non-Markov, 116 Error recoveryrules, 131- 132
. SeeAdaptation Evolutionary adaptation Evolutionary , 8, 173 computation , 176 , - 183 181 , 185 reinfo ~ entlearning Evolutionary , 171 , - 175 173 - 184 Fitness , 31- 32, 176 , 181 GA. SeeGenetic algorithm Genetic , 10- 11, 23, 30- 33, 40-41, algorithm 81, 171174 , 183
Goal, 118 GoFER , 172 Groundedtranslation ,7
HAM SI' ER, 160- 165 Homomorphicmodel. SeeModel Humantrainer. SeeTrainer .s.~ifi~ system( ICS), 33- 37, Improvedcl@ 41- 42, 53- 55, 67 Informationcompression , 53
Internalmessage , 25, 63, 65- 66, 69 - 119 Internalstate , 14, 16, 52, 69, 117 , 122 , - 134 124 , 133 , 138 , 170 LCS. SeeLearningcl$ !ifier system Learning , 58, 78, 106 phase reinforcement , 2- 3, 8, 124, 149, 151, 169, 171- 177 session , 59, 123, ,3 supervised , 9- 10, 21- 33, 45Learningclassifiersystem 56, 151, 171- 172 creditapportionment , 23, 26- 30 system , 23- 26, performance system rulediscoverysystem , 23, 30- 33, 40- 41 Mann-Whitneynonparametric test, 59, 6566, 69, 124, 132, 136, 141 Matchset, 25- 26 Memory, 14, 66, 69, 117- 118 , 70 message sensor , 14, 69- 70 word, 70 list, 25-26, 28, 60 Message list length, 70 Message ML. SeeMessage list Model, 9, 143, 179 , 30 homomorphic , 30 quasi-homomorphic world, 6- 7 Modularity, 7, 37, 183- 185 MonaLysa , 172 Mutationoperator , 31, 36, 181
203
Index . 34- 36. 68 Mutespecoperator Nascentrobot, 145. 149 Noisysensonandmotors, 100 Nomnodu1ar systems . 184- 185 Ontogeneticadaptation. SeeAdaptation Optimal control , 8- 9 Oscillating cl~~~ ~ , 35- 36
PanmicticGA, 330 ParallelGA, 40- 41 ParallelICS, 37 Parallelism , 37- 42 high-level, 37, 41- 42, 7S low-level, 37- 41 Performance , 58, 149- 150. SeealsoQuality of behavior cumulative , 58 globalindex, 152 learningindex, 151 localindex, 151 measure , 58 .~~ ~ system system(seeLearningr.l@ ) . SeeAdaptation Phylogenetic adaptation Policy. SeeShapingpolicy , 30- 33, 173- 174, 182 Population
Product , 3, 144 quality - 167 , 166 Proprioception sensor , 167 Proprioceptive PuniAhment , 4, 8, 65- 67
Robot , 2, 169 shaping shell, 144- 147 Robuter, 160- 162 RP. SeeReinforcement program Rulediscoverysystem . SeeLearning ~l~~ifi~r system SAMUEL. 171 . 37. 42. 88- 89 Scalability Sensorgranularity . 91- 93. 113- 114 Sensormemory . SeeMemory Sensorimotor interf~ , 11. 122, 146- 148. 177- 17R Sequence behavioral(seeSequential behavior ) proper, 52, 119 - , 52, 119, 162 pseudo behavior , 14, 52, 115- 116, 118 Sequential , 2, 6- 7, 16, 4647, 169- 170, 174Shaping 177 holistic, 55, 78, 82- 83 modular, 55, 78, 83, 85 policy, 55- 56, 77, 149 three-phasemodular, 85- 86 two-phasemodular, 85
- 173 , 172 Spanky Stateword, 14, 122 state , 34, 67- 68 Steady
- 176 , 9- 10, 171 Q-learning , 9, 171- 173 Temporaldifference - 182 TESEO, 173 - 151 , 150 , 176 , 181 Qualityof behavior Testsession adaptiveness , 151 , 181 , 58- 59, 78, 123 correctness , 150 ftexibility, 151 , 150(seealsoPerformance performance ) robustness , 2, 95, 100, 103, 150- 151 model. SeeModel Quasihomomorphic
- 119 Reactive (system ), 5, 115 agent Reinforcement , 4, 175 , 178 , 180 delayed immediate - 176 , 4, 152 , 174 (seePunishment negative ) (seeReward positive ) sensor , 128 , 130131 - 178 , 169 , 177 , 180 steP -by-step Reinforcement , 5, 149 program ftexible (RPnex ), 127 rigid(RPrig ), 125 - 118 , 6-7, 117 , Representation 32 , Reproduction operator Reward , 4, 8, 65 -the-intention Reward , 98-99 policy -the-result Reward , 98-99 policy
Trainer, 4- 5, 16, 77, 98- 100, 169- 170, 174176, 178. SeealsoReinforcement program human , 4- 5, 174, 176, 180 Training, 6- 7, 149- 150, 152 policy, 100, 124(seealsoTrainingstrategy ) , 145, 149, 159 strategy Transition -based external , 121, 125, 132- 134 result-based , 121, 128, 138
- 122 - 133 - 127 , 119 , 121 , 124 , 132 signal , 11, 37, 96, 110 , 155 Transputer Valuefunction, 9- 10 Virtual ~ r , S3 World model. SeeModel
XCS,75n ZCS,66
This excerpt from Robot Shaping. Marco Dorigo and Marco Colombetti. © 1997 The MIT Press. is provided in screen-viewable form for personal use only by members of MIT CogNet. Unauthorized use or dissemination of this information is expressly forbidden. If you have any questions about this material, please contact [email protected].