Learning Temporal Coherent Features through Life-Time Sparsity

In this paper, we consider the problem of unsupervised feature learning for spatio-temporal data streams, specifically video data. We focus on the problem of learning features invariant to image transformations and regard a video stream as a set of pairwise similiar images. Many existing methods dealing with the problem of invariant feature extraction either try to build a model of the transformations present in the data or achieve invariance by adding a penalty to a reconstruction loss term. In contrast to this, we propose to learn invariant features by directly optimizing the temporal coherence of a hidden, and possibly deep, representation. We find that our approach is both fast and capable of learning deep feature representations invariant to complex image transformations. We furthermore show that features learned using our approach can be used to improve object recognition performance in still images (Caltech-101, STL-10).

[1]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[2]  Rajesh P. N. Rao,et al.  Bilinear Sparse Coding for Invariant Vision , 2005, Neural Computation.

[3]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[4]  John D. Lafferty,et al.  Learning image representations from the pixel level via hierarchical sparse coding , 2011, CVPR 2011.

[5]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[6]  Quoc V. Le,et al.  Tiled convolutional neural networks , 2010, NIPS.

[7]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8]  Geoffrey E. Hinton,et al.  Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Yoshua Bengio,et al.  Suitability of V1 Energy Models for Object Classification , 2011, Neural Computation.

[10]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[11]  Bruno A. Olshausen,et al.  Bilinear models of natural images , 2007, Electronic Imaging.

[12]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[13]  Jiquan Ngiam,et al.  Sparse Filtering , 2011, NIPS.

[14]  Geoffrey E. Hinton,et al.  Gated Softmax Classification , 2010, NIPS.

[15]  Y-Lan Boureau,et al.  Learning Convolutional Feature Hierarchies for Visual Recognition , 2010, NIPS.

[16]  D. Tolhurst,et al.  Characterizing the sparseness of neural codes , 2001 .

[17]  David J. Field,et al.  What Is the Goal of Sensory Coding? , 1994, Neural Computation.

[18]  Laurenz Wiskott,et al.  Slow feature analysis yields a rich repertoire of complex cell properties. , 2005, Journal of vision.

[19]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[20]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[21]  Andrew Y. Ng,et al.  Selecting Receptive Fields in Deep Networks , 2011, NIPS.

[22]  Aapo Hyvärinen,et al.  Temporal Coherence, Natural Image Sequences, and the Visual Cortex , 2002, NIPS.

[23]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[24]  I. Ohzawa,et al.  Stereoscopic depth discrimination in the visual cortex: neurons ideally suited as disparity detectors. , 1990, Science.

[25]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[26]  Chuan Yi Tang,et al.  A 2.|E|-Bit Distributed Algorithm for the Directed Euler Trail Problem , 1993, Inf. Process. Lett..

[27]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[28]  Aapo Hyvärinen,et al.  Bubbles: a unifying framework for low-level statistical properties of natural image sequences. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.