Exploiting generalization in the subspaces for faster model-based learning

Due to the lack of enough generalization in the state-space, common methods in Reinforcement Learning (RL) suffer from slow learning speed especially in the early learning trials. This paper introduces a model-based method in discrete state-spaces for increasing learning speed in terms of required experience (but not required computational time) by exploiting generalization in the experiences of the subspaces. A subspace is formed by choosing a subset of features in the original state representation (full-space). Generalization and faster learning in a subspace are due to many-to-one mapping of experiences from the full-space to each state in the subspace. Nevertheless, due to inherent perceptual aliasing in the subspaces, the policy suggested by each subspace does not generally converge to the optimal policy. Our approach, called Model Based Learning with Subspaces (MoBLeS), calculates confidence intervals of the estimated Q-values in the full-space and in the subspaces. These confidence intervals are used in the decision making, such that the agent benefits the most from the possible generalization while avoiding from detriment of the perceptual aliasing in the subspaces. Convergence of MoBLeS to the optimal policy is theoretically investigated. Additionally, we show through several experiments that MoBLeS improves the learning speed in the early trials.

[1]  Majid Nili Ahmadabadi,et al.  Interactive Learning in Continuous Multimodal Space: A Bayesian Approach to Action-Based Soft Partitioning and Learning , 2012, IEEE Transactions on Autonomous Mental Development.

[2]  Leslie Pack Kaelbling,et al.  DetH*: Approximate Hierarchical Solution of Large Markov Decision Processes , 2011, IJCAI.

[3]  Hamid Beigy,et al.  A novel graphical approach to automatic abstraction in reinforcement learning , 2013, Robotics Auton. Syst..

[4]  Michael L. Littman,et al.  Reinforcement learning improves behaviour from evaluative feedback , 2015, Nature.

[5]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[6]  Richard S. Sutton,et al.  Efficient planning in MDPs by small backups , 2013, ICML 2013.

[7]  Y. Niv,et al.  Discovering latent causes in reinforcement learning , 2015, Current Opinion in Behavioral Sciences.

[8]  Majid Nili Ahmadabadi,et al.  Context Transfer and Q-Transferable Tasks , 2015, AAAI Workshop: Knowledge, Skill, and Behavior Transfer in Autonomous Robots.

[9]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[10]  M. N. Ahmadabadi,et al.  Reward Maximization Justifies the Transition from Sensory Selection at Childhood to Sensory Integration at Adulthood , 2014, PloS one.

[11]  Majid Nili Ahmadabadi,et al.  Expertness based cooperative Q-learning , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[12]  John N. Tsitsiklis,et al.  Introduction to Probability , 2002 .

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[15]  Manuela M. Veloso,et al.  Multiagent Systems: A Survey from a Machine Learning Perspective , 2000, Auton. Robots.

[16]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[17]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[18]  R. Bellman A Markovian Decision Process , 1957 .

[19]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.

[20]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[21]  Robert C. Wilson,et al.  Reinforcement Learning in Multidimensional Environments Relies on Attention Mechanisms , 2015, The Journal of Neuroscience.

[22]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[23]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[24]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[27]  Majid Nili Ahmadabadi,et al.  Cost-sensitive learning of top-down modulation for attentional control , 2009, Machine Vision and Applications.

[28]  Thomas G. Dietterich,et al.  Automatic discovery and transfer of MAXQ hierarchies , 2008, ICML '08.

[29]  Majid Nili Ahmadabadi,et al.  Context Transfer in Reinforcement Learning Using Action-Value Functions , 2014, Comput. Intell. Neurosci..

[30]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[31]  Robert Givan,et al.  Bounded Parameter Markov Decision Processes , 1997, ECP.