Emergence of Complex-Like Cells in a Temporal Product Network with Local Receptive Fields

We introduce a new neural architecture and an unsupervised algorithm for learning invariant representations from temporal sequence of images. The system uses two groups of complex cells whose outputs are combined multiplicatively: one that represents the content of the image, constrained to be constant over several consecutive frames, and one that represents the precise location of features, which is allowed to vary over time but constrained to be sparse. The architecture uses an encoder to extract features, and a decoder to reconstruct the input from the features. The method was applied to patches extracted from consecutive movie frames and produces orientation and frequency selective units analogous to the complex cells in V1. An extension of the method is proposed to train a network composed of units with local receptive field spread over a large image of arbitrary size. A layer of complex cells, subject to sparsity constraints, pool feature units over overlapping local neighborhoods, which causes the feature units to organize themselves into pinwheel patterns of orientation-selective receptive fields, similar to those observed in the mammalian visual cortex. A feed-forward encoder efficiently computes the feature representation of full images.

[1]  K. Obermayer,et al.  Geometry of orientation and ocular dominance columns in monkey striate cortex , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[2]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[3]  M. Stryker,et al.  Ocular dominance peaks at pinwheel center singularities of the orientation map in cat visual cortex. , 1997, Journal of neurophysiology.

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  Aapo Hyvärinen,et al.  A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images , 2001, Vision Research.

[6]  Geoffrey E. Hinton,et al.  Products of Hidden Markov Models , 2001, AISTATS.

[7]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[8]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[9]  Laurenz Wiskott,et al.  Slow feature analysis yields a rich repertoire of complex cell properties. , 2005, Journal of vision.

[10]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[12]  Bruno A. Olshausen,et al.  Learning Transformational Invariants from Natural Movies , 2008, NIPS.

[13]  Nicolas Pinto,et al.  Why is Real-World Visual Object Recognition Hard? , 2008, PLoS Comput. Biol..

[14]  Yoshua Bengio,et al.  Slow, Decorrelated Features for Pretraining Complex Cell-like Networks , 2009, NIPS.

[15]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[17]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[18]  Geoffrey E. Hinton,et al.  Factored conditional restricted Boltzmann Machines for modeling motion style , 2009, ICML '09.

[19]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Richard E. Turner,et al.  A Structured Model of Video Reproduces Primary Visual Cortical Organisation , 2009, PLoS Comput. Biol..

[21]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[22]  David D. Cox,et al.  A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation , 2009, PLoS Comput. Biol..

[23]  Michael S. Lewicki,et al.  Emergence of complex cell properties by learning to generalize in natural scenes , 2009, Nature.

[24]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Marc'Aurelio Ranzato,et al.  Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition , 2010, ArXiv.

[26]  Niko Wilbert,et al.  Slow feature analysis , 2011, Scholarpedia.