### Reinforcement Learning with Self-Modifying Policies

A learner''s modifiable components are called its policy. An algorithm that modifies the policy is a learning algorithm. If the learning algorithm has modifiable components represented as part of the policy, then we speak of a self-modifying policy (SMP). SMPs can modify the way they modify themselves etc. They are of interest in situations where the initial learning algorithm itself can be improved by experience - this is what we call learning to learn''''. How can we force some (stochastic) SMP to trigger better and better self-modifications? The success-story algorithm (SSA) addresses this question in a lifelong reinforcement learning context. During the learner''s life-time, SSA is occasionally called at times computed according to SMP itself. SSA uses backtracking to undo those SMP-generated SMP-modifications that have not been empirically observed to trigger lifelong reward accelerations (measured up until the current SSA call - this evaluates the long-term effects of SMP-modifications setting the stage for later SMP-modifications). SMP-modifications that survive SSA represent a lifelong success history. Until the next SSA call, they build the basis for additional SMP-modifications. Solely by self-modifications our SMP/SSA-based learners solve a complex task in a partially observable environment (POE) whose state space is far bigger than most reported in the POE literature.

[1]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[2]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[4]  Mark S. Boddy,et al.  Deliberation Scheduling for Problem Solving in Time-Constrained Environments , 1994, Artif. Intell..

[5]  Juergen Schmidhuber,et al.  Incremental self-improvement for life-time multi-agent reinforcement learning , 1996 .

[6]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[7]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[8]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences: statistical considerations , 1969, JACM.

[9]  Paul E. Utgoff,et al.  Shift of bias for inductive concept learning , 1984 .

[10]  Jürgen Schmidhuber A ‘Self-Referential’ Weight Matrix , 1993 .

[11]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[12]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[13]  Dave Cliff,et al.  Adding Temporary Memory to ZCS , 1994, Adapt. Behav..

[14]  Thomas G. Dietterich Machine learning , 1996, CSUR.

[15]  Jürgen Schmidhuber,et al.  Solving POMDPs with Levin Search and EIRA , 1996, ICML.

[16]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[17]  Osamu Watanabe,et al.  Kolmogorov Complexity and Computational Complexity , 2012, EATCS Monographs on Theoretical Computer Science.

[19]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[20]  Dana H. Ballard,et al.  Active Perception and Reinforcement Learning , 1990, Neural Computation.

[21]  Russell Greiner PALO: A Probabilistic Hill-Climbing Algorithm , 1996, Artif. Intell..

[22]  Leonid A. Levin,et al.  Randomness Conservation Inequalities; Information and Independence in Mathematical Theories , 1984, Inf. Control..

[23]  Ray J. Solomonoff The Application of Algorithmic Probability to Problems in Artificial Intelligence , 1985, UAI.

[24]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1993, Texts and Monographs in Computer Science.

[25]  Juergen Schmidhuber On learning how to learn learning strategies , 1994 .

[26]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[27]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[28]  J. Schmidhuber A neural network that embeds its own meta-levels , 1993, IEEE International Conference on Neural Networks.

[29]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[30]  Stuart J. Russell,et al.  Principles of Metareasoning , 1989, Artif. Intell..

[32]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[33]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[34]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[35]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part II , 1964, Inf. Control..

[36]  Jürgen Schmidhuber,et al.  Discovering Neural Nets with Low Kolmogorov Complexity and High Generalization Capability , 1997, Neural Networks.

[37]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[38]  Pattie Maes,et al.  Incremental Self-Improvement for Life-Time Multi-Agent Reinforcement Learning , 1996 .

[39]  Douglas B. Lenat,et al.  Theory Formation by Heuristic Search , 1983, Artif. Intell..

[40]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .