Reinforcement Learning Through Gradient Descent

Abstract : Reinforcement learning is often done using parameterized function approximators to store value functions. Algorithms are typically developed for lookup tables, and then applied to function approximators by using backpropagation. This can lead to algorithms diverging on very small, simple MDPs and Markov chains, even with linear function approximators and epoch wise training. These algorithms are also very difficult to analyze, and difficult to combine with other algorithms. A series of new families of algorithms are derived based on stochastic gradient descent. Since they are derived from first principles with function approximators in mind, they have guaranteed convergence to local minima, even on general nonlinear function approximators. For both residual algorithms and VAPS algorithms, it is possible to take any of the standard algorithms in the field, such as Q learning or SARSA or value iteration, and rederive a new form of it with provable convergence. In addition to better convergence properties, it is shown how gradient descent allows an inelegant, inconvenient algorithm like Advantage updating to be converted into a much simpler and more easily analyzed algorithm like Advantage learning. hi this case that is very useful, since Advantages can be learned thousands of times faster than Q values for continuous time problems. In this case, there are significant practical benefits of using gradient descent based techniques. In addition to improving both the theory and practice of existing types of algorithms, the gradient descent approach makes it possible to create entirely new classes of reinforcement learning algorithms. VAPS algorithms can be derived that ignore values altogether, and simply learn good policies directly. One hallmark of gradient descent is the ease with which different algorithms can be combined, and this is a prime example.

[1]  N. Rajan,et al.  Pursuit-Evasion of Two Aircraft in a Horizontal Plane , 1980 .

[2]  Lamberto Cesari,et al.  Optimization-Theory And Applications , 1983 .

[3]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[5]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[6]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[7]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[8]  H. White Some Asymptotic Results for Learning in Single Hidden-Layer Feedforward Network Models , 1989 .

[9]  R. Sutton,et al.  Connectionist Learning for Control: An Overview , 1989 .

[10]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[11]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[12]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[13]  L. Baird,et al.  A MATHEMATICAL ANALYSIS OF ACTOR-CRITIC ARCHITECTURES FOR LEARNING OPTIMAL CONTROLS THROUGH INCREMENTAL DYNAMIC PROGRAMMING , 1990 .

[14]  Gerald Tesauro,et al.  Neurogammon: a neural-network backgammon program , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[15]  Peter J. Millington,et al.  Associative reinforcement learning for optimal control , 1991 .

[16]  Steven J. Bradtke,et al.  Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[17]  Vijaykumar Gullapalli,et al.  Reinforcement learning and its application to control , 1992 .

[18]  Olvi L. Mangasarian,et al.  Backpropagation Convergence via Deterministic Nonmonotone Perturbed Minimization , 1993, NIPS.

[19]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[20]  Leemon C Baird,et al.  Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[21]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[22]  M. V. Solodovy,et al.  STABILITY PROPERTIES OF THE GRADIENT PROJECTION METHOD WITH APPLICATIONS TO THE BACKPROPAGATION ALGORITHM , 1994 .

[23]  Alexei A. Gaivoronski,et al.  Convergence properties of backpropagation for neural nets via theory of stochastic gradient methods. Part 1 , 1994 .

[24]  Mikhail Solodov,et al.  STABILITY PROPERTIES OF THE GRADIENT PROJECTION METHOD WITH APPLICATIONS TO THE BACKPROPAGATION ALGORITHM , 1994 .

[25]  A. Harry Klopf,et al.  Advantage Updating Applied to a Differrential Game , 1994, NIPS.

[26]  Andrew G. Barto,et al.  Reinforcement Learning and Dynamic Programming , 1995 .

[27]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[28]  V. Tresp,et al.  Missing and noisy data in nonlinear time-series prediction , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[29]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[30]  Geoffrey J. Gordon Stable Fitted Reinforcement Learning , 1995, NIPS.

[31]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[32]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[33]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[34]  Mikhail V. Solodov,et al.  Nonmonotone and perturbed optimization , 1996 .

[35]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[36]  M. V. SolodovyJune Convergence Analysis of Perturbed Feasible Descent Methods , 1997 .

[37]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[38]  M. Solodov,et al.  Error Stability Properties of Generalized Gradient-Type Algorithms , 1998 .

[39]  Peter Marbach,et al.  Simulation-based optimization of Markov decision processes , 1998 .

[40]  Leslie Pack Kaelbling,et al.  Learning Policies with External Memory , 1999, ICML.

[41]  Stanton Earl Weaver,et al.  A Theoretical Framework for Local Adaptive Networks in Static and Dynamic Systems , 1999 .

[42]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[43]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..