A Neural Autoregressive Approach to Attention-based Recognition

Tasks that require the synchronization of perception and action are incredibly hard and pose a fundamental challenge to the fields of machine learning and computer vision. One important example of such a task is the problem of performing visual recognition through a sequence of controllable fixations; this requires jointly deciding what inference to perform from fixations and where to perform these fixations. While these two problems are challenging when addressed separately, they become even more formidable if solved jointly. Recently, a restricted Boltzmann machine (RBM) model was proposed that could learn meaningful fixation policies and achieve good recognition performance. In this paper, we propose an alternative approach based on a feed-forward, auto-regressive architecture, which permits exact calculation of training gradients (given the fixation sequence), unlike for the RBM model. On a problem of facial expression recognition, we demonstrate the improvement gained by this alternative approach. Additionally, we investigate several variations of the model in order to shed some light on successful strategies for fixation-based recognition.

[1]  L. T. Troland Helmholtz's Treatise on Physiological Optics , 1926 .

[2]  H. Helmholtz Helmholtz's Treatise on Physiological Optics , 1963 .

[3]  Jürgen Schmidhuber,et al.  Learning to Generate Artificial Fovea Trajectories for Target Detection , 1991, Int. J. Neural Syst..

[4]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[5]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[6]  W. Geisler,et al.  Optimal Eye Movement Strategies in Visual Search ( Supplement ) , 2005 .

[7]  H. Basford,et al.  Optimal eye movement strategies in visual search , 2005 .

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[11]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[12]  S. Grossberg,et al.  View-invariant object category learning, recognition, and search: How spatial and object attention are coordinated using surface-based attentional shrouds , 2009, Cognitive Psychology.

[13]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[14]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[16]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[17]  Garrison W. Cottrell,et al.  Robust classification of objects, faces, and flowers using natural image statistics , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Javier R. Movellan,et al.  Infomax Control of Eye Movements , 2010, IEEE Transactions on Autonomous Mental Development.

[19]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[20]  William D. Smart,et al.  A POMDP Model of Eye-Hand Coordination , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[21]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[22]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[23]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[24]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[25]  Nando de Freitas,et al.  Learning attentional policies for tracking and recognition in video with deep networks , 2011, ICML.

[26]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[27]  Misha Denil,et al.  Learning Where to Attend with Deep Architectures for Image Tracking , 2011, Neural Computation.

[28]  Yael Pritch,et al.  Saliency filters: Contrast based filtering for salient region detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Hugo Larochelle,et al.  A Neural Autoregressive Topic Model , 2012, NIPS.

[30]  Hugo Larochelle,et al.  RNADE: The real-valued neural autoregressive density-estimator , 2013, NIPS.

[31]  Cristian Sminchisescu,et al.  Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths , 2013, NIPS.