Evolutionary Search, Stochastic Policies with Memory, and Reinforcement Learning with Hidden State

Reinforcement Learning (RL) problems with hidden state present signiicant obstacles to prevailing RL methods. In this paper , we present experiments conducted using a straightforward approach to solving such problems that trains artiicial neural networks with recurrent connections to represent action policies using evolutionary search. We apply this method to two diierent benchmark problems. The rst problem, involving maze navigation, has original features. Key among these features are that it is scalable, and thus provides a benchmark for investigating the performance of algorithms on problems with increasingly larger state spaces, and that it facilitates the study of inter-task transfer of search eeort by providing for the generation of multiple, related tasks. The second problem, New York Driving, was introduced by McCallum (1995). Previously reported results for this task provide a basis for comparison with the evolutionary approach. Singh et al. (1994) demonstrated that in RL problems with hidden state, the best memory-less policy may be a stochastic one. Of particular interest in this study is our nd-ing that in practice, the ability to represent stochastic policies can signiicantly enhance the performance of evolutionary search for policies with memory. We explore this phenomenon via the use of recurrent networks composed of stochastic units as a means for representing policies.

[1]  Lawrence J. Fogel,et al.  Artificial Intelligence through Simulated Evolution , 1966 .

[2]  Terrence J. Sejnowski,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cognitive Sciences.

[3]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[4]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[5]  Astro Teller,et al.  The evolution of mental models , 1994 .

[6]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[7]  Peter J. Angeline,et al.  An evolutionary algorithm that constructs recurrent neural networks , 1994, IEEE Trans. Neural Networks.

[8]  J. K. Kinnear,et al.  Advances in Genetic Programming , 1994 .

[9]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[10]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11]  Andrew McCallum,et al.  Learning to Use Selective Attention and Short-Term Memory in Sequential Tasks , 1996 .

[12]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[13]  A. McCallum Efficient Exploration in Reinforcement Learning with Hidden State , 1997 .

[14]  Jürgen Schmidhuber,et al.  Reinforcement Learning with Self-Modifying Policies , 1998, Learning to Learn.

[15]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[16]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[17]  Katia P. Sycara,et al.  Evolution of Goal-Directed Behavior from Limited Information in a Complex Environment , 1999, GECCO.

[18]  John J. Grefenstette,et al.  Evolutionary Algorithms for Reinforcement Learning , 1999, J. Artif. Intell. Res..