AGIL: Learning Attention from Human for Visuomotor Tasks

When intelligent agents learn visuomotor behaviors from human demonstrations, they may benefit from knowing where the human is allocating visual attention, which can be inferred from their gaze. A wealth of information regarding intelligent decision making is conveyed by human gaze allocation; hence, exploiting such information has the potential to improve the agents’ performance. With this motivation, we propose the AGIL (Attention Guided Imitation Learning) framework. We collect high-quality human action and gaze data while playing Atari games in a carefully controlled experimental setting. Using these data, we first train a deep neural network that can predict human gaze positions and visual attention with high accuracy (the gaze network) and then train another network to predict human actions (the policy network). Incorporating the learned attention model from the gaze network into the policy network significantly improves the action prediction accuracy and task performance.

[1]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Yizhou Yu,et al.  Visual saliency based on multiscale deep features , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[4]  Pieter R. Roelfsema,et al.  Attention-Gated Reinforcement Learning of Internal Representations for Classification , 2005, Neural Computation.

[5]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[6]  Nathalie Guyader,et al.  Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos , 2009, International Journal of Computer Vision.

[7]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[8]  Wojciech Matusik,et al.  Eye Tracking for Everyone , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Clay B. Holroyd,et al.  The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. , 2002, Psychological review.

[10]  Mary Hayhoe,et al.  Predicting human visuomotor behaviour in a driving task , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[11]  Xiaogang Wang,et al.  Saliency detection by multi-context deep learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Frédo Durand,et al.  A Benchmark of Computational Models of Saliency to Predict Human Fixations , 2012 .

[13]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[14]  Andrea Palazzi,et al.  Predicting the Driver's Focus of Attention: The DR(eye)VE Project , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yuan Chang Leong,et al.  Dynamic Interaction between Reinforcement Learning and Attention in Multidimensional Environments , 2017, Neuron.

[16]  Wilson S. Geisler,et al.  Gaze-contingent real-time simulation of arbitrary visual fields , 2002, IS&T/SPIE Electronic Imaging.

[17]  Nicolas Riche,et al.  Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Heiner Deubel,et al.  Deployment of visual attention before sequences of goal-directed hand movements , 2006, Vision Research.

[20]  Mary Hayhoe,et al.  Saccades to future ball location reveal memory-based prediction in a virtual-reality interception task. , 2013, Journal of vision.

[21]  Matthew E. Taylor,et al.  Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning , 2017, ArXiv.

[22]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[23]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[24]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[25]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[26]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[28]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[29]  Mary M Hayhoe,et al.  Task and context determine where you look. , 2016, Journal of vision.

[30]  Frédo Durand,et al.  Where Should Saliency Models Look Next? , 2016, ECCV.

[31]  Jonathan D. Cohen,et al.  The effects of neural gain on attention and learning , 2013, Nature Neuroscience.

[32]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[33]  Debasish Biswas,et al.  Visualization of unsteady viscous flow around turbine blade , 2008, J. Vis..

[34]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[35]  Laurent Itti,et al.  Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  D. Ballard,et al.  Eye guidance in natural vision: reinterpreting salience. , 2011, Journal of vision.

[37]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[38]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[41]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[42]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[43]  D. Ballard,et al.  Eye movements in natural behavior , 2005, Trends in Cognitive Sciences.

[44]  Nasser Mozayani,et al.  Learning to predict where to look in interactive environments using deep recurrent q-learning , 2016, ArXiv.