论文信息 - Explorations in efficient reinforcement learning

Explorations in efficient reinforcement learning

This thesis describes reinforcement learning (RL) methods which can solve sequential decision making problems by learning from trial and error. Sequential decision making problems are problems in which an artificial agent interacts with a specific environment through its sensors (to get inputs) and effectors (to make actions). To measure the goodness of some agent's behavior, a reward function is used which determines how much an agent is rewarded or penalized for performing particular actions in particular environmental states. The goal is to find an action selection policy for the agent which maximizes the cumulative reward collected in the future. In RL, an agent's policy maps sensorbased inputs to actions. To evaluate a policy, a value function is learned which returns for each possible state the future cumulative reward collected by following the current policy. Given a value function, we can simply select the action with the largest value. In order to learn a value function for a specific problem, reinforcement learning methods simulate a policy and use the resulting agent's experiences consisting of quadruples. There are different RL problems and different RL methods for solving them. We describe different categories of problems and introduce new methods for solving them.

Marco Wiering | M. Wiering

[1] Edsger W. Dijkstra,et al. A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[2] Arthur L. Samuel,et al. Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[3] F. d'Epenoux,et al. A Probabilistic Production and Inventory Problem , 1963 .

[4] R. Bellman,et al. V. Adaptive Control Processes , 1964 .

[5] Gwilym M. Jenkins,et al. Time series analysis, forecasting and control , 1972 .

[6] Gwilym M. Jenkins,et al. Time series analysis, forecasting and control , 1971 .

[7] Nils J. Nilsson,et al. Problem-solving methods in artificial intelligence , 1971, McGraw-Hill computer science series.

[8] J. Albus. A Theory of Cerebellar Function , 1971 .

[9] E. J. Sondik,et al. The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[10] Alan J. Mayne,et al. Generalized Inverse of Matrices and its Applications , 1972 .

[11] W. J. Studden,et al. Theory Of Optimal Experiments , 1972 .

[12] K. S. Banerjee. Generalized Inverse of Matrices and Its Applications , 1973 .

[13] Ingo Rechenberg,et al. Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[14] P. Werbos,et al. Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[15] W. Vent,et al. Rechenberg, Ingo, Evolutionsstrategie — Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. 170 S. mit 36 Abb. Frommann‐Holzboog‐Verlag. Stuttgart 1973. Broschiert , 1975 .

[16] John H. Holland,et al. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[17] James S. Albus,et al. New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[18] Hans J. Berliner,et al. Experiences in Evaluation with BKG - A Program that Plays Backgammon , 1977, IJCAI.

[19] Jon Louis Bentley,et al. An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1976, TOMS.

[20] J J Hopfield,et al. Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[21] Jan Telgen,et al. Stochastic Dynamic Programming , 1982 .

[22] Geoffrey E. Hinton,et al. OPTIMAL PERCEPTUAL INFERENCE , 1983 .

[23] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[24] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[25] W. Hamilton,et al. The Evolution of Cooperation , 1984 .

[26] Donald A. Berry,et al. Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[27] Michael Ian Shamos,et al. Computational geometry: an introduction , 1985 .

[28] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[29] Teuvo Kohonen,et al. Self-Organization and Associative Memory , 1988 .

[30] Bernard Widrow,et al. Adaptive switching circuits , 1988 .

[31] N. Wermuth,et al. Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative , 1989 .

[32] Ingo Rechenberg,et al. Evolution Strategy: Nature’s Way of Optimization , 1989 .

[33] B. Widrow,et al. The truck backer-upper: an example of self-learning in neural networks , 1989, International 1989 Joint Conference on Neural Networks.

[34] C. Watkins. Learning from delayed rewards , 1989 .

[35] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[36] Eric B. Baum,et al. A Proposal for More Powerful Learning Algorithms , 1989, Neural Computation.

[37] Jürgen Schmidhuber,et al. Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[38] E. Ziegel. Optimal design and analysis of experiments , 1990 .

[39] J. Bather,et al. Multi‐Armed Bandit Allocation Indices , 1990 .

[40] Andrew W. Moore,et al. Efficient memory-based learning for robot control , 1990 .

[41] Stephen M. Omohundro,et al. Bumptrees for Efficient Function, Constraint and Classification Learning , 1990, NIPS.

[42] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[43] J. Stephen Judd,et al. Neural network design and the complexity of learning , 1990, Neural network modeling and connectionism.

[44] Michael I. Jordan,et al. Hierarchies of Adaptive Experts , 1991, NIPS.

[45] John R. Koza,et al. Genetic evolution and co-evolution of computer programs , 1991 .

[46] Leslie Pack Kaelbling,et al. Input Generalization in Delayed Reinforcement Learning: An Algorithm and Performance Comparisons , 1991, IJCAI.

[47] Jürgen Schmidhuber,et al. Learning to generate sub-goals for action sequences , 1991 .

[48] Steven J. Nowlan,et al. Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[49] Stewart W. Wilson,et al. A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers , 1991 .

[50] Sebastian Thrun,et al. Active Exploration in Dynamic Environments , 1991, NIPS.

[51] Jürgen Schmidhuber,et al. Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[52] W. Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[53] Geoffrey E. Hinton,et al. Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[54] Satinder P. Singh,et al. The Efficient Learning of Multiple Task Sequences , 1991, NIPS.

[55] Geoffrey E. Hinton,et al. Feudal Reinforcement Learning , 1992, NIPS.

[56] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[57] Jürgen Schmidhuber,et al. Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[58] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[59] G. Tesauro. Practical Issues in Temporal Difference Learning , 1992 .

[60] Lonnie Chrisman,et al. Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[61] Steven Douglas Whitehead,et al. Reinforcement learning for the adaptive control of perception and action , 1992 .

[62] S. Resnick. Adventures in stochastic processes , 1992 .

[63] Sebastian Thrun,et al. Efficient Exploration In Reinforcement Learning , 1992 .