Learning where to search using visual attention

One of the central tasks for a household robot is searching for specific objects. It does not only require localizing the target object but also identifying promising search locations in the scene if the target is not immediately visible. As computation time and hardware resources are usually limited in robotics, it is desirable to avoid expensive visual processing steps that are exhaustively applied over the entire image. The human visual system can quickly select those image locations that have to be processed in detail for a given task. This allows us to cope with huge amounts of information and to efficiently deploy the limited capacities of our visual system. In this paper, we therefore propose to use human fixation data to train a top-down saliency model that predicts relevant image locations when searching for specific objects. We show that the learned model can successfully prune bounding box proposals without rejecting the ground truth object locations. In this aspect, the proposed model outperforms a model that is trained only on the ground truth segmentations of the target object instead of fixation data.

[1]  D. Broadbent Perception and communication , 1958 .

[2]  J. Deutsch,et al.  Attention: Some theoretical considerations. , 1963 .

[3]  A. L. Yarbus Eye Movements During Perception of Complex Objects , 1967 .

[4]  PIERRE VAN DE LAAR,et al.  Task-Dependent Learning of Attention , 1997, Neural Networks.

[5]  L. Itti,et al.  Modeling the influence of task on attention , 2005, Vision Research.

[6]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[7]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Danica Kragic,et al.  An Active Vision System for Detecting, Fixating and Manipulating Objects in the Real World , 2010, Int. J. Robotics Res..

[9]  Chengyao Shen Learning High-Level Concepts by Training A Deep Network on Eye Fixations , 2012 .

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[12]  Cristian Sminchisescu,et al.  Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths , 2013, NIPS.

[13]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[14]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[18]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[19]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Matthias Bethge,et al.  Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet , 2014, ICLR.

[21]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[22]  Simone Frintrop,et al.  Traditional saliency reloaded: A good old model in new shape , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Trevor Darrell,et al.  Fully convolutional networks for semantic segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.