Inverse Reinforcement Learning via Nonparametric Spatio-Temporal Subgoal Modeling

Recent advances in the field of inverse reinforcement learning (IRL) have yielded sophisticated frameworks which relax the original modeling assumption that the behavior of an observed agent reflects only a single intention. Instead, the demonstration data is typically divided into parts, to account for the fact that different trajectories may correspond to different intentions, e.g., because they were generated by different domain experts. In this work, we go one step further: using the intuitive concept of subgoals, we build upon the premise that even a single trajectory can be explained more efficiently locally within a certain context than globally, enabling a more compact representation of the observed behavior. Based on this assumption, we build an implicit intentional model of the agent's goals to forecast its behavior in unobserved situations. The result is an integrated Bayesian prediction framework which provides smooth policy estimates that are consistent with the expert's plan and significantly outperform existing IRL solutions. Most notably, our framework naturally handles situations where the intentions of the agent change with time and classical IRL algorithms fail. In addition, due to its probabilistic nature, the model can be straightforwardly applied in an active learning setting to guide the demonstration process of the expert.

[1]  Christos Dimitrakakis,et al.  Bayesian Multitask Inverse Reinforcement Learning , 2011, EWRL.

[2]  Pravesh Ranchod,et al.  Nonparametric Bayesian reward segmentation for skill discovery using inverse reinforcement learning , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3]  Jun Nakanishi,et al.  Learning Movement Primitives , 2005, ISRR.

[4]  Christos Dimitrakakis,et al.  Preference elicitation and inverse reinforcement learning , 2011, ECML/PKDD.

[5]  Csaba Szepesvári,et al.  Apprenticeship Learning using Inverse Reinforcement Learning and Gradient Methods , 2007, UAI.

[6]  Marc Toussaint,et al.  Learned graphical models for probabilistic planning provide a new class of movement primitives , 2013, Front. Comput. Neurosci..

[7]  Claudio Gentile,et al.  Boltzmann Exploration Done Right , 2017, NIPS.

[8]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[11]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[12]  Sanjay Krishnan,et al.  HIRL: Hierarchical Inverse Reinforcement Learning for Long-Horizon Tasks with Delayed Rewards , 2016, ArXiv.

[13]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Jonathan P. How,et al.  Bayesian Nonparametric Inverse Reinforcement Learning , 2012, ECML/PKDD.

[16]  Amit Surana,et al.  Bayesian Nonparametric Inverse Reinforcement Learning for Switched Markov Decision Processes , 2014, 2014 13th International Conference on Machine Learning and Applications.

[17]  Nicholas J. Foti,et al.  A Survey of Non-Exchangeable Priors for Bayesian Nonparametric Models , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[19]  Bruce M. Kapron,et al.  Dynamic graph connectivity in polylogarithmic worst case time , 2013, SODA.

[20]  Heinz Koeppl,et al.  Inverse Reinforcement Learning via Nonparametric Subgoal Modeling , 2018, AAAI Spring Symposia.

[21]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[22]  G. Roberts,et al.  Updating Schemes, Correlation Structure, Blocking and Parameterization for the Gibbs Sampler , 1997 .

[23]  Er Meng Joo,et al.  A survey of inverse reinforcement learning techniques , 2012 .

[24]  Scott Niekum,et al.  Learning and generalization of complex tasks from unstructured demonstrations , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[26]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[27]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[28]  Heinz Koeppl,et al.  A Bayesian Approach to Policy Recognition and State Representation Learning , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[30]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[31]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[32]  Xiaodong Li,et al.  Learning Options for an MDP from Demonstrations , 2015, ACALCI.

[33]  Jan Peters,et al.  Learning movement primitive libraries through probabilistic segmentation , 2017, Int. J. Robotics Res..

[34]  Piotr J. Gmytrasiewicz,et al.  Interactive POMDPs with finite-state models of other agents , 2017, Autonomous Agents and Multi-Agent Systems.

[35]  Kian Hsiang Low,et al.  Inverse Reinforcement Learning with Locally Consistent Reward Functions , 2015, NIPS.

[36]  Sergey Levine,et al.  Nonlinear Inverse Reinforcement Learning with Gaussian Processes , 2011, NIPS.

[37]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[38]  Michael L. Littman,et al.  Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[39]  Kee-Eung Kim,et al.  Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions , 2012, NIPS.

[40]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[41]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[42]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[43]  Alicia P. Wolfe,et al.  Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[44]  Mostafa Al-Emran,et al.  Hierarchical Reinforcement Learning: A Survey , 2015 .

[45]  D. Aldous Exchangeability and related topics , 1985 .

[46]  Jonathan P. How,et al.  Bayesian Nonparametric Reward Learning From Demonstration , 2015, IEEE Transactions on Robotics.

[47]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[48]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[49]  Peter Stone,et al.  Autonomous agents modelling other agents: A comprehensive survey and open problems , 2017, Artif. Intell..

[50]  Jonathan P. How,et al.  Scalable reward learning from demonstration , 2013, 2013 IEEE International Conference on Robotics and Automation.

[51]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[52]  Kenneth Dixon,et al.  Introduction to Stochastic Modeling , 2011 .

[53]  M. Botvinick Hierarchical reinforcement learning and decision making , 2012, Current Opinion in Neurobiology.

[54]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[55]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[56]  Scott Kuindersma,et al.  Robot learning from demonstration by constructing skill trees , 2012, Int. J. Robotics Res..

[57]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..