Simple Principles of Metalearning

The goal of metalearning is to generate useful shifts of inductive bias by adapting the current learning strategy in a ``useful'''' way. Our learner leads a single life during which actions are continually executed according to the system''s internal state and current {\em policy} (a modifiable, probabilistic algorithm mapping environmental inputs and internal states to outputs and new internal states). An action is considered a learning algorithm if it can modify the policy. Effects of learning processes on later learning processes are measured using reward/time ratios. Occasional backtracking enforces success histories of still valid policy modifications corresponding to histories of lifelong reward accelerations. The principle allows for plugging in a wide variety of learning algorithms. In particular, it allows for embedding the learner''s policy modification strategy within the policy itself (self-reference). To demonstrate the principle''s feasibility in cases where conventional reinforcement learning fails, we test it in complex, non-Markovian, changing environments (``POMDPs''''). One of the tasks involves more than $10^{13}$ states, two learners that both cooperate and compete, and strongly delayed reinforcement signals (initially separated by more than 300,000 time steps).

[1]  Ray J. Solomonoff,et al.  A Formal Theory of Inductive Inference. Part I , 1964, Inf. Control..

[2]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[3]  Gregory J. Chaitin,et al.  On the Length of Programs for Computing Finite Binary Sequences: statistical considerations , 1969, JACM.

[4]  Douglas B. Lenat,et al.  Theory Formation by Heuristic Search , 1983, Artificial Intelligence.

[5]  Leonid A. Levin,et al.  Randomness Conservation Inequalities; Information and Independence in Mathematical Theories , 1984, Inf. Control..

[6]  Paul E. Utgoff,et al.  Shift of bias for inductive concept learning , 1984 .

[7]  Nichael Lynn Cramer,et al.  A Representation for the Adaptive Generation of Simple Sequential Programs , 1985, ICGA.

[8]  Ray J. Solomonoff,et al.  The Application of Algorithmic Probability to Problems in Artificial Intelligence , 1985, UAI.

[9]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[10]  P. W. Jones,et al.  Bandit Problems, Sequential Allocation of Experiments , 1987 .

[11]  Christian M. Ernst,et al.  Multi-armed Bandit Allocation Indices , 1989 .

[12]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[13]  John R. Koza,et al.  Genetic evolution and co-evolution of computer programs , 1991 .

[14]  Stuart J. Russell,et al.  Principles of Metareasoning , 1989, Artif. Intell..

[15]  Osamu Watanabe,et al.  Kolmogorov Complexity and Computational Complexity , 2012, EATCS Monographs on Theoretical Computer Science.

[16]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[17]  Jürgen Schmidhuber,et al.  A ‘Self-Referential’ Weight Matrix , 1993 .

[18]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[19]  Mark S. Boddy,et al.  Deliberation Scheduling for Problem Solving in Time-Constrained Environments , 1994, Artif. Intell..

[20]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[21]  Juergen Schmidhuber,et al.  On learning how to learn learning strategies , 1994 .

[22]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[23]  Dave Cliff,et al.  Adding Temporary Memory to ZCS , 1994, Adapt. Behav..

[24]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[25]  Corso Elvezia Discovering Solutions with Low Kolmogorov Complexity and High Generalization Capability , 1995 .

[26]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[27]  Corso Elvezia A General Method for Incremental Self-improvement and Multi-agent Learning in Unrestricted Environments , 1996 .

[28]  Russell Greiner,et al.  PALO: A Probabilistic Hill-Climbing Algorithm , 1996, Artif. Intell..

[29]  Jürgen Schmidhuber,et al.  Solving POMDPs with Levin Search and EIRA , 1996, ICML.

[30]  Juergen Schmidhuber,et al.  Incremental self-improvement for life-time multi-agent reinforcement learning , 1996 .

[31]  Juergen Schmidhuber,et al.  A General Method For Incremental Self-Improvement And Multi-Agent Learning In Unrestricted Environme , 1999 .

[32]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..