Accelerating Reinforcement Learning through Implicit Imitation

Imitation can be viewed as a means of enhancing learning in multiagent environments. It augments an agent's ability to learn useful behaviors by making intelligent use of the knowledge implicit in behaviors demonstrated by cooperative teachers or other more experienced agents. We propose and study a formal model of implicit imitation that can accelerate reinforcement learning dramatically in certain cases. Roughly, by observing a mentor, a reinforcement-learning agent can extract information about its own capabilities in, and the relative value of, unvisited parts of the state space. We study two specific instantiations of this model, one in which the learning agent and the mentor have identical abilities, and one designed to deal with agents and mentors with difierent action sets. We illustrate the benefits of implicit imitation by integrating it with prioritized sweeping, and demonstrating improved performance and convergence through observation of single and multiple mentors. Though we make some stringent assumptions regarding observability and possible interactions, we briefly comment on extensions of the model that relax these restricitions.

[1]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2]  Walter L. Smith Probability and Statistics , 1959, Nature.

[3]  J. Hartmanis Algebraic structure theory of sequential machines (Prentice-Hall international series in applied mathematics) , 1966 .

[4]  J. Hartmanis,et al.  Algebraic Structure Theory Of Sequential Machines , 1966 .

[5]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[6]  Tom M. Mitchell,et al.  LEAP: A Learning Apprentice for VLSI Design , 1985, IJCAI.

[7]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[8]  E. Visalberghi,et al.  “Language” and intelligence in monkeys and apes: Do monkeys ape? , 1990 .

[9]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[10]  Steven D. Whitehead,et al.  A Complexity Analysis of Cooperative Mechanisms in Reinforcement Learning , 1991, AAAI.

[11]  Roger B. Myerson,et al.  Game theory - Analysis of Conflict , 1991 .

[12]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[13]  Paul E. Utgoff,et al.  Two Kinds of Training Information For Evaluation Function Learning , 1991, AAAI.

[14]  Long Ji Lin,et al.  Self-improvement Based on Reinforcement Learning, Planning and Teaching , 1991, ML.

[15]  David Lee,et al.  Online minimization of transition systems (extended abstract) , 1992, STOC '92.

[16]  G. Fiorito,et al.  Observational Learning in Octopus vulgaris , 1992, Science.

[17]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[18]  J. Mi,et al.  A comparison of the Bonferroni and Scheffé bounds , 1993 .

[19]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[20]  Donald Michie,et al.  Knowledge, Learning and Machine Intelligence , 1993 .

[21]  Henry Lieberman,et al.  Mondrian: a teachable graphical editor , 1993, INTERCHI.

[22]  A. Russon,et al.  Imitation in free-ranging rehabilitant orangutans (Pongo pygmaeus). , 1993, Journal of comparative psychology.

[23]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[24]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[25]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[26]  Masayuki Inaba,et al.  Learning by watching: extracting reusable task knowledge from visual observation of human performance , 1994, IEEE Trans. Robotics Autom..

[27]  Ivan Bratko,et al.  Reconstructing Human Skill with Machine Learning , 1994, ECAI.

[28]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[29]  Peter Bakker,et al.  Robot see, robot do: An overview of robot imitation , 1996 .

[30]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[31]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[32]  Ivan Bratko,et al.  Skill Reconstruction as Induction of LQ Controllers with Subgoals , 1997, IJCAI.

[33]  Aude Billard,et al.  Learning to Communicate Through Imitation in Autonomous Robots , 1997, ICANN.

[34]  Russell Greiner,et al.  Why Experimentation can be better than "Perfect Guidance" , 1997, ICML.

[35]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[36]  Craig Boutilier,et al.  Abstraction and Approximate Decision-Theoretic Planning , 1997, Artif. Intell..

[37]  Robert Givan,et al.  Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[38]  Yiannis Demiris,et al.  Do Robots Ape , 1997 .

[39]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[40]  Maja J. Matarić,et al.  Behavior-based primitives for articulated control , 1998 .

[41]  Maja J. Mataric,et al.  Using communication to reduce locality in distributed multiagent learning , 1997, J. Exp. Theor. Artif. Intell..

[42]  S. Pattinson,et al.  Learning to fly. , 1998 .

[43]  Kerstin Dautenhahn,et al.  Mapping between dissim ilar bodies: Affordances and the algebraic foundations of imitation , 1998 .

[44]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[45]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[46]  R. Byrne,et al.  Priming primates: Human and otherwise , 1998, Behavioral and Brain Sciences.

[47]  John E. Laird,et al.  Learning Hierarchical Performance Knowledge by Observation , 1999, ICML.

[48]  Craig Boutilier,et al.  Sequential Optimality and Coordination in Multiagent Systems , 1999, IJCAI.

[49]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[50]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[51]  Ramon López de Mántaras,et al.  Imitating human performances to automatically generate expressive jazz ballads , 1999 .

[52]  Aude Billard,et al.  DRAMA, a Connectionist Architecture for Control and Learning in Autonomous Robots , 1999, Adapt. Behav..

[53]  Aude Billard,et al.  Imitation skills as a means to enhance learning of a synthetic proto-language in an autonomous robot , 1999 .

[54]  P. Todd,et al.  Is it really imitation? A review of simple mechanisms in social information gathering , 1999 .

[55]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[56]  Jeffrey M. Forbes,et al.  Practical reinforcement learning in continuous domains , 2000 .

[57]  Kerstin Dautenhahn,et al.  Learning how to do things with imitation , 2000 .

[58]  Manuela M. Veloso,et al.  Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[59]  Mario Paolucci,et al.  Intelligent Social Learning , 2001, J. Artif. Soc. Soc. Simul..

[60]  Craig Boutilier,et al.  A Bayesian Approach to Imitation in Reinforcement Learning , 2003, IJCAI.

[61]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[62]  S. Bocionek,et al.  Robot programming by Demonstration (RPD): Supporting the induction by human interaction , 1996, Machine Learning.

[63]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[64]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[65]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[66]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[67]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[68]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[69]  Rüdiger Dillmann,et al.  Robot Programming by Demonstration (RPD): Supporting the Induction by Human Interaction , 1996, Machine Learning.

[70]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[71]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.