Explorations in efficient reinforcement learning

This thesis describes reinforcement learning (RL) methods which can solve sequential decision making problems by learning from trial and error. Sequential decision making problems are problems in which an artificial agent interacts with a specific environment through its sensors (to get inputs) and effectors (to make actions). To measure the goodness of some agent's behavior, a reward function is used which determines how much an agent is rewarded or penalized for performing particular actions in particular environmental states. The goal is to find an action selection policy for the agent which maximizes the cumulative reward collected in the future. In RL, an agent's policy maps sensorbased inputs to actions. To evaluate a policy, a value function is learned which returns for each possible state the future cumulative reward collected by following the current policy. Given a value function, we can simply select the action with the largest value. In order to learn a value function for a specific problem, reinforcement learning methods simulate a policy and use the resulting agent's experiences consisting of quadruples. There are different RL problems and different RL methods for solving them. We describe different categories of problems and introduce new methods for solving them.

[1]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[2]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[3]  F. d'Epenoux,et al.  A Probabilistic Production and Inventory Problem , 1963 .

[4]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[5]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1972 .

[6]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[7]  Nils J. Nilsson,et al.  Problem-solving methods in artificial intelligence , 1971, McGraw-Hill computer science series.

[8]  J. Albus A Theory of Cerebellar Function , 1971 .

[9]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[10]  Alan J. Mayne,et al.  Generalized Inverse of Matrices and its Applications , 1972 .

[11]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[12]  K. S. Banerjee Generalized Inverse of Matrices and Its Applications , 1973 .

[13]  Ingo Rechenberg,et al.  Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[14]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[15]  W. Vent,et al.  Rechenberg, Ingo, Evolutionsstrategie — Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. 170 S. mit 36 Abb. Frommann‐Holzboog‐Verlag. Stuttgart 1973. Broschiert , 1975 .

[16]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[17]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[18]  Hans J. Berliner,et al.  Experiences in Evaluation with BKG - A Program that Plays Backgammon , 1977, IJCAI.

[19]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1976, TOMS.

[20]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Jan Telgen,et al.  Stochastic Dynamic Programming , 1982 .

[22]  Geoffrey E. Hinton,et al.  OPTIMAL PERCEPTUAL INFERENCE , 1983 .

[23]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[24]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[25]  W. Hamilton,et al.  The Evolution of Cooperation , 1984 .

[26]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[27]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[28]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[29]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[30]  Bernard Widrow,et al.  Adaptive switching circuits , 1988 .

[31]  N. Wermuth,et al.  Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative , 1989 .

[32]  Ingo Rechenberg,et al.  Evolution Strategy: Nature’s Way of Optimization , 1989 .

[33]  B. Widrow,et al.  The truck backer-upper: an example of self-learning in neural networks , 1989, International 1989 Joint Conference on Neural Networks.

[34]  C. Watkins Learning from delayed rewards , 1989 .

[35]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[36]  Eric B. Baum,et al.  A Proposal for More Powerful Learning Algorithms , 1989, Neural Computation.

[37]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[38]  E. Ziegel Optimal design and analysis of experiments , 1990 .

[39]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[40]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[41]  Stephen M. Omohundro,et al.  Bumptrees for Efficient Function, Constraint and Classification Learning , 1990, NIPS.

[42]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[43]  J. Stephen Judd,et al.  Neural network design and the complexity of learning , 1990, Neural network modeling and connectionism.

[44]  Michael I. Jordan,et al.  Hierarchies of Adaptive Experts , 1991, NIPS.

[45]  John R. Koza,et al.  Genetic evolution and co-evolution of computer programs , 1991 .

[46]  Leslie Pack Kaelbling,et al.  Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[47]  Jürgen Schmidhuber,et al.  Learning to generate sub-goals for action sequences , 1991 .

[48]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[49]  Stewart W. Wilson,et al.  A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers , 1991 .

[50]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[51]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[52]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[53]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[54]  Satinder P. Singh,et al.  The Efficient Learning of Multiple Task Sequences , 1991, NIPS.

[55]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[56]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[57]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[58]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[59]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[60]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[61]  Steven Douglas Whitehead,et al.  Reinforcement learning for the adaptive control of perception and action , 1992 .

[62]  S. Resnick Adventures in stochastic processes , 1992 .

[63]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[64]  David A. Cohn,et al.  Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[65]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[66]  Bernd Fritzke Supervised Learning with Growing Cell Structures , 1993, NIPS.

[67]  Michael K. Sahota,et al.  Real-time intelligent behaviour in dynamic environments : soccer-playing robots , 1993 .

[68]  K. Lindgren,et al.  Cooperation and community structure in artificial ecosystems , 1993 .

[69]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[70]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[71]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[72]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[73]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[74]  Jing Peng,et al.  Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[75]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[76]  Astro Teller,et al.  The evolution of mental models , 1994 .

[77]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[78]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[79]  M.A.F. Mcdonald,et al.  Approximate Discounted Dynamic Programming Is Unreliable , 1994 .

[80]  Dana Ron,et al.  Learning probabilistic automata with variable memory length , 1994, COLT '94.

[81]  Sebastian Thrun,et al.  Learning to Play the Game of Chess , 1994, NIPS.

[82]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[83]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[84]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[85]  Maja J. Mataric,et al.  Interaction and intelligent behavior , 1994 .

[86]  Shumeet Baluja,et al.  A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning , 1994 .

[87]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[88]  Dave Cliff,et al.  Adding Temporary Memory to ZCS , 1994, Adapt. Behav..

[89]  Luis M. de Campos,et al.  Probability Intervals: a Tool for uncertain Reasoning , 1994, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[90]  Stewart W. Wilson ZCS: A Zeroth Level Classifier System , 1994, Evolutionary Computation.

[91]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[92]  Sebastian Thrun,et al.  Finding Structure in Reinforcement Learning , 1994, NIPS.

[93]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[94]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[95]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[96]  S. Hochreiter,et al.  REINFORCEMENT DRIVEN INFORMATION ACQUISITION IN NONDETERMINISTIC ENVIRONMENTS , 1995 .

[97]  Chen K. Tham,et al.  Reinforcement learning of multiple tasks using a hierarchical CMAC architecture , 1995, Robotics Auton. Syst..

[98]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[99]  Manuela M. Veloso,et al.  Beating a Defender in Robotic Soccer: Memory-Based Learning of a Continuous Function , 1995, NIPS.

[100]  A. Roth,et al.  Learning in Extensive-Form Games: Experimental Data and Simple Dynamic Models in the Intermediate Term* , 1995 .

[101]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[102]  Pawel Cichosz,et al.  Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning , 1994, J. Artif. Intell. Res..

[103]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[104]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[105]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[106]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[107]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[108]  Erkki Oja,et al.  Signal Separation by Nonlinear Hebbian Learning , 1995 .

[109]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[110]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[111]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[112]  M. A. Wiering TD Learning of Game Evaluation Functions with Hierarchies Neural Architectures , 1995 .

[113]  Tuomas Sandholm,et al.  On Multiagent Q-Learning in a Semi-Competitive Domain , 1995, Adaption and Learning in Multi-Agent Systems.

[114]  Pattie Maes,et al.  Emergent Hierarchical Control Structures: Learning Reactive/Hierarchical Relationships in Reinforcement Environments , 1996 .

[115]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[116]  Pattie Maes,et al.  Incremental Self-Improvement for Life-Time Multi-Agent Reinforcement Learning , 1996 .

[117]  K. Trovato A* planning in discrete configuration spaces of autonomous systems , 1996 .

[118]  Wenju Liu,et al.  Planning in Stochastic Domains: Problem Characteristics and Approximation , 1996 .

[119]  Peter Dayan,et al.  Exploration bonuses and dual control , 1996 .

[120]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[121]  Marco Dorigo,et al.  Ant system: optimization by a colony of cooperating agents , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[122]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[123]  Mark Humphrys,et al.  Action Selection methods using Reinforcement Learning , 1996 .

[124]  Jordan B. Pollack,et al.  Why did TD-Gammon Work? , 1996, NIPS.

[125]  Jürgen Schmidhuber,et al.  Solving POMDPs with Levin Search and EIRA , 1996, ICML.

[126]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[127]  Juergen Schmidhuber,et al.  Incremental self-improvement for life-time multi-agent reinforcement learning , 1996 .

[128]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[129]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[130]  Jieyu Zhao,et al.  Simple Principles of Metalearning , 1996 .

[131]  Jeff G. Schneider,et al.  Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning , 1996, NIPS.

[132]  Juergen Schmidhuber,et al.  A General Method For Incremental Self-Improvement And Multi-Agent Learning In Unrestricted Environme , 1999 .

[133]  Jürgen Schmidhuber,et al.  Evolving Soccer Strategies , 1997, ICONIP.

[134]  Fernando J. Pineda,et al.  Mean-Field Theory for Batched TD() , 1997, Neural Computation.

[135]  Chun-Shin Lin,et al.  Learning convergence of CMAC technique , 1997, IEEE Trans. Neural Networks.

[136]  James A. Hendler,et al.  Co-evolving Soccer Softbot Team Coordination with Genetic Programming , 1997, RoboCup.

[137]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[138]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[139]  Hiroaki Kitano,et al.  RoboCup: The Robot World Cup Initiative , 1997, AGENTS '97.

[140]  Luc Stells,et al.  Constructing and Sharing Perceptual Distiinctions , 1997, ECML.

[141]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[142]  Ashwin Ram,et al.  Experiments with Reinforcement Learning in Problems with Continuous State and Action Spaces , 1997, Adapt. Behav..

[143]  J. Schmidhuber What''s interesting? , 1997 .

[144]  Rafal Salustowicz,et al.  Probabilistic Incremental Program Evolution , 1997, Evolutionary Computation.

[145]  Tomas Landelius,et al.  Reinforcement Learning and Distributed Local Model Synthesis , 1997 .

[146]  Terrence J. Sejnowski,et al.  The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[147]  Luca Maria Gambardella,et al.  Ant colony system: a cooperative learning approach to the traveling salesman problem , 1997, IEEE Trans. Evol. Comput..

[148]  Jürgen Schmidhuber,et al.  On Learning Soccer Strategies , 1997, ICANN.

[149]  Doina Precup,et al.  Theoretical Results on Reinforcement Learning with Temporally Abstract Options , 1998, ECML.

[150]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[151]  Andrew W. Moore,et al.  Applying Online Search Techniques to Continuous-State Reinforcement Learning , 1998, AAAI/IAAI.

[152]  Marco Dorigo,et al.  An adaptive multi-agent routing algorithm inspired by ants behavior , 1998 .

[153]  Jürgen Schmidhuber,et al.  Reinforcement Learning with Self-Modifying Policies , 1998, Learning to Learn.

[154]  C. Lee Giles,et al.  How embedded memory in recurrent neural network architectures helps learning long-term temporal dependencies , 1998, Neural Networks.

[155]  Sebastian Thrun,et al.  Learning Metric-Topological Maps for Indoor Mobile Robot Navigation , 1998, Artif. Intell..

[156]  Jürgen Schmidhuber,et al.  Efficient model-based exploration , 1998 .

[157]  Marco Dorigo,et al.  Learning to Control Forest Fires , 1998 .

[158]  Jürgen Schmidhuber,et al.  CMAC models learn to play soccer , 1998 .

[159]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[160]  J. Dam,et al.  Environment modelling for mobile robots: neural learning for sensor fusion , 1998 .

[161]  Manuela M. Veloso,et al.  Team-partitioned, opaque-transition reinforcement learning , 1999, AGENTS '99.

[162]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[163]  Robert Givan,et al.  Bounded-parameter Markov decision processes , 2000, Artif. Intell..