论文信息 - Multi-timescale nexting in a reinforcement learning robot

Multi-timescale nexting in a reinforcement learning robot

The term ‘nexting’ has been used by psychologists to refer to the propensity of people and many other animals to continually predict what will happen next in an immediate, local, and personal sense. The ability to ‘next’ constitutes a basic kind of awareness and knowledge of one’s environment. In this paper we present results with a robot that learns to next in real time, making thousands of predictions about sensory input signals at timescales from 0.1 to 8 seconds. Our predictions are formulated as a generalization of the value functions commonly used in reinforcement learning, where now an arbitrary function of the sensory input signals is used as a pseudo reward, and the discount rate determines the timescale. We show that six thousand predictions, each computed as a function of six thousand features of the state, can be learned and updated online ten times per second on a laptop computer, using the standard temporal-difference(λ) algorithm with linear function approximation. This approach is sufficiently computationally efficient to be used for real-time learning on the robot and sufficiently data efficient to achieve substantial accuracy within 30 minutes. Moreover, a single tile-coded feature representation suffices to accurately predict many different signals over a significant range of timescales. We also extend nexting beyond simple timescales by letting the discount rate be a function of the state and show that nexting predictions of this more general form can also be learned with substantial accuracy. General nexting provides a simple yet powerful mechanism for a robot to acquire predictive knowledge of the dynamics of its environment.

[1] F. W. Irwin. Purposive Behavior in Animals and Men , 1932, The Psychological Clinic.

[2] W. Brogden. Sensory pre-conditioning. , 1939 .

[3] Gwilym M. Jenkins,et al. Time series analysis, forecasting and control , 1971 .

[4] Michael Cunningham. Intelligence: Its Organization and Development , 1972 .

[5] P. Young,et al. Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[6] Roger C. Schank,et al. Computer Models of Thought and Language , 1974 .

[7] R. Rescorla. Simultaneous and successive associations in sensory preconditioning. , 1980, Journal of experimental psychology. Animal behavior processes.

[8] Lennart Ljung,et al. System Identification: Theory for the User , 1987 .

[9] Richard S. Sutton,et al. Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[10] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[11] M. Gabriel,et al. Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[12] Gary L. Drescher,et al. Made-up minds - a constructivist approach to artificial intelligence , 1991 .

[13] Geoffrey E. Hinton,et al. Feudal Reinforcement Learning , 1992, NIPS.

[14] Gerard Casey. Minds and machines , 1992 .

[15] Satinder P. Singh,et al. Reinforcement Learning with a Hierarchy of Abstract Models , 1992, AAAI.

[16] Leslie Pack Kaelbling,et al. Learning to Achieve Goals , 1993, IJCAI.

[17] Richard S. Sutton,et al. TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[18] Michael I. Jordan,et al. An internal model for sensorimotor integration. , 1995, Science.

[19] Eduardo F. Camacho,et al. Model predictive control in the process industry , 1995 .

[20] Benjamin Kuipers,et al. Map Learning with Uninterpreted Sensors and Effectors , 1995, Artif. Intell..

[21] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[22] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[23] K. Carlsson,et al. Tickling Expectations: Neural Processing in Anticipation of a Sensory Stimulus , 2000, Journal of Cognitive Neuroscience.

[24] Paul R. Cohen,et al. A Method for Clustering the Experiences of a Mobile Robot that Accords with Human Judgments , 2000, AAAI/IAAI.

[25] Richard S. Sutton,et al. Predictive Representations of State , 2001, NIPS.

[26] Sebastian Thrun,et al. Online simultaneous localization and mapping with detection and tracking of moving objects: theory and results from a ground vehicle in crowded urban areas , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[27] Marko Bacic,et al. Model predictive control , 2003 .

[28] Martin V. Butz,et al. Anticipatory Behavior in Adaptive Learning Systems , 2003, Lecture Notes in Computer Science.

[29] Olivier Sigaud,et al. Anticipatory Behavior in Adaptive Learning Systems: Foundations, Theories, and Systems , 2003 .

[30] J. L. Roux. An Introduction to the Kalman Filter , 2003 .

[31] Michael R. James,et al. Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[32] H. Sebastian Seung,et al. Stochastic policy gradient reinforcement learning on a simple 3D biped , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[33] Rick Grush,et al. The emulation theory of representation: Motor control, imagery, and perception , 2004, Behavioral and Brain Sciences.

[34] J. Hawkins,et al. On Intelligence , 2004 .

[35] Richard S. Sutton,et al. Temporal-Difference Networks , 2004, NIPS.

[36] Jean-Arcady Meyer,et al. Adaptive Behavior , 2005 .

[37] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[38] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[39] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[40] Steven M. LaValle,et al. Planning algorithms , 2006 .

[41] Sebastian Thrun,et al. Stanley: The robot that won the DARPA Grand Challenge , 2006, J. Field Robotics.

[42] D. Levitin. This Is Your Brain on Music , 2006 .

[43] C. Stevens,et al. Sweet Anticipation: Music and the Psychology of Expectation, by David Huron . Cambridge, Massachusetts: MIT Press, 2006 , 2007 .

[44] Pierre-Yves Oudeyer,et al. Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[45] Martin V. Butz,et al. Anticipatory Behavior in Adaptive Learning Systems, From Brains to Individual and Social Behavior [the book is a result from the third workshop on anticipatory behavior in adaptive learning systems, ABiALS 2006, Rome, Italy, September 30, 2006, colocated with SAB 2006] , 2007, ABiALS book.

[46] R. Sutton,et al. A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[47] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[48] Jun Tani,et al. Emergence of Functional Hierarchy in a Multiple Timescale Neural Network Model: A Humanoid Robot Experiment , 2008, PLoS Comput. Biol..

[49] Giovanni Pezzulo,et al. Coordinating with the Future: The Anticipatory Nature of Representation , 2008, Minds and Machines.

[50] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[51] Geoffrey W. Sutton. Stumbling on Happiness , 2008 .

[52] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[53] R. Sutton. The Grand Challenge of Predictive Empirical Abstract Knowledge , 2009 .

[54] Byron Boots,et al. Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[55] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[56] P. I. Pavlov. Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex. , 1929, Annals of Neurosciences.

[57] R. Sutton,et al. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[58] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[59] Richard S. Sutton,et al. Beyond Reward: The Problem of Knowledge and Data , 2011, ILP.

[60] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[61] Richard S. Sutton,et al. Multi-timescale Nexting in a Reinforcement Learning Robot , 2012, SAB.

[62] Patrick M. Pilarski,et al. Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[63] Richard S. Sutton,et al. Scaling life-long off-policy learning , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[64] A. Clark. Whatever next? Predictive brains, situated agents, and the future of cognitive science. , 2013, The Behavioral and brain sciences.

[65] Paul W. Goldberg,et al. Autonomous Agents and Multiagent Systems , 2016, Lecture Notes in Computer Science.