Learning Where to Attend with Deep Architectures for Image Tracking

We discuss an attentional model for simultaneous object tracking and recognition that is driven by gaze data. Motivated by theories of perception, the model consists of two interacting pathways, identity and control, intended to mirror the what and where pathways in neuroscience models. The identity pathway models object appearance and performs classification using deep (factored)-restricted Boltzmann machines. At each point in time, the observations consist of foveated images, with decaying resolution toward the periphery of the gaze. The control pathway models the location, orientation, scale, and speed of the attended object. The posterior distribution of these states is estimated with particle filtering. Deeper in the control pathway, we encounter an attentional mechanism that learns to select gazes so as to minimize tracking uncertainty. Unlike in our previous work, we introduce gaze selection strategies that operate in the presence of partial information and on a continuous action space. We show that a straightforward extension of the existing approach to the partial information setting results in poor performance, and we propose an alternative method based on modeling the reward surface as a gaussian process. This approach gives good performance in the presence of partial information and allows us to expand the action space from a small, discrete set of fixation points to a continuous domain.

[1]  Drew McDermott,et al.  Planning and Acting , 1978, Cogn. Sci..

[2]  Leslie G. Ungerleider Two cortical visual systems , 1982 .

[3]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[4]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[5]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[6]  M. Goodale,et al.  Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[7]  D. V. van Essen,et al.  A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[8]  C. D. Perttunen,et al.  Lipschitzian optimization without the Lipschitz constant , 1993 .

[9]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[10]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[11]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[12]  Michael Isard,et al.  Contour Tracking by Stochastic Propagation of Conditional Density , 1996, ECCV.

[13]  Eric O. Postma,et al.  SCAN: A Scalable Model of Attentional Selection , 1997, Neural Networks.

[14]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[15]  M. Goldberg,et al.  The representation of visual salience in monkey parietal cortex , 1998, Nature.

[16]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .

[17]  Nando de Freitas,et al.  An Introduction to Sequential Monte Carlo Methods , 2001, Sequential Monte Carlo Methods in Practice.

[18]  J. Colombo The development of visual attention in infancy. , 2001, Annual review of psychology.

[19]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[20]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[21]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[22]  M. Rosa Visual maps in the adult primate cerebral cortex: some implications for brain development and evolution. , 2002, Brazilian journal of medical and biological research = Revista brasileira de pesquisas medicas e biologicas.

[23]  Ankur Teredesai,et al.  Detection of inconsistent regions in video streams , 2004, IS&T/SPIE Electronic Imaging.

[24]  James J. Little,et al.  A Boosted Particle Filter: Multitarget Detection and Tracking , 2004, ECCV.

[25]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[26]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[27]  Wilson S. Geisler,et al.  Optimal eye movement strategies in visual search , 2005, Nature.

[28]  A. Berthoz,et al.  From brainstem to cortex: Computational models of saccade generation circuitry , 2005, Progress in Neurobiology.

[29]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[30]  Bruce L. McNaughton,et al.  Path integration and the neural basis of the 'cognitive map' , 2006, Nature Reviews Neuroscience.

[31]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[32]  Nando de Freitas,et al.  Active Policy Learning for Robot Planning and Exploration under Uncertainty , 2007, Robotics: Science and Systems.

[33]  Nuno Vasconcelos,et al.  The discriminant center-surround hypothesis for bottom-up saliency , 2007, NIPS.

[34]  Aapo Hyvärinen,et al.  A Two-Layer ICA-Like Model Estimated by Score Matching , 2007, ICANN.

[35]  N.J. Butko,et al.  I-POMDP: An infomax model of eye movement , 2008, 2008 7th IEEE International Conference on Development and Learning.

[36]  Nando de Freitas,et al.  Target-directed attention: Sequential decision-making for gaze planning , 2008, 2008 IEEE International Conference on Robotics and Automation.

[37]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Janet Hui-wen Hsiao,et al.  NIMBLE: a kernel density model of saccade-based visual memory. , 2008, Journal of vision.

[39]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[40]  Vladimir Pavlovic,et al.  Face tracking and recognition with visual constraints in real-world videos , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[42]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[43]  Matthew H Tong,et al.  of the Annual Meeting of the Cognitive Science Society Title SUNDAy : Saliency Using Natural Statistics for Dynamic Analysis of Scenes Permalink , 2009 .

[44]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[46]  Loris Bazzani Learning attentional mechanisms for simultaneous object tracking and recognition with deep networks , 2010 .

[47]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[48]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[49]  Garrison W. Cottrell,et al.  Robust classification of objects, faces, and flowers using natural image statistics , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[50]  Jonathan D. Nelson,et al.  Experience Matters , 2010, Psychological science.

[51]  David J. Fleet,et al.  Dynamical binary latent variable models for 3D human pose tracking , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  R. O’Reilly The What and How of prefrontal cortical organization , 2010, Trends in Neurosciences.

[53]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[54]  Geoffrey E. Hinton,et al.  Modeling pixel means and covariances using factorized third-order boltzmann machines , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[55]  Jiri Matas,et al.  Face-TLD: Tracking-Learning-Detection applied to faces , 2010, 2010 IEEE International Conference on Image Processing.

[56]  Nando de Freitas,et al.  A tutorial on stochastic approximation algorithms for training Restricted Boltzmann Machines and Deep Belief Nets , 2010, 2010 Information Theory and Applications Workshop (ITA).

[57]  William D. Smart,et al.  A POMDP Model of Eye-Hand Coordination , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[58]  Nando de Freitas,et al.  On Autoencoders and Score Matching for Energy Based Models , 2011, ICML.

[59]  Nando de Freitas,et al.  Learning attentional policies for tracking and recognition in video with deep networks , 2011, ICML.

[60]  Nando de Freitas,et al.  Portfolio Allocation for Bayesian Optimization , 2010, UAI.

[61]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[62]  Roland Memisevic,et al.  Learning to Relate Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.