Invariant recognition drives neural representations of action sequences

Recognizing the actions of others from visual stimuli is a crucial aspect of human perception that allows individuals to respond to social cues. Humans are able to discriminate between similar actions despite transformations, like changes in viewpoint or actor, that substantially alter the visual appearance of a scene. This ability to generalize across complex transformations is a hallmark of human visual intelligence. Advances in understanding action recognition at the neural level have not always translated into precise accounts of the computational principles underlying what representations of action sequences are constructed by human visual cortex. Here we test the hypothesis that invariant action discrimination might fill this gap. Recently, the study of artificial systems for static object perception has produced models, Convolutional Neural Networks (CNNs), that achieve human level performance in complex discriminative tasks. Within this class, architectures that better support invariant object recognition also produce image representations that better match those implied by human and primate neural data. However, whether these models produce representations of action sequences that support recognition across complex transformations and closely follow neural representations of actions remains unknown. Here we show that spatiotemporal CNNs accurately categorize video stimuli into action classes, and that deliberate model modifications that improve performance on an invariant action recognition task lead to data representations that better match human neural recordings. Our results support our hypothesis that performance on invariant discrimination dictates the neural representations of actions computed in the brain. These results broaden the scope of the invariant recognition framework for understanding visual intelligence from perception of inanimate objects and faces in static images to the study of human perception of action sequences.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Michael I. Jordan,et al.  A more biologically plausible learning rule for neural networks. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  R. Vogels,et al.  Functional differentiation of macaque visual temporal cortical neurons using a parametric action space. , 2009, Cerebral cortex.

[5]  Joel Z. Leibo,et al.  How can cells in the anterior medial face patch be viewpoint invariant , 2011 .

[6]  Nikolaus Kriegeskorte,et al.  Representational Similarity Analysis – Connecting the Branches of Systems Neuroscience , 2008, Frontiers in systems neuroscience.

[7]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[8]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Eero P. Simoncelli,et al.  How MT cells analyze the motion of visual patterns , 2006, Nature Neuroscience.

[10]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[11]  Eero P. Simoncelli,et al.  A model of neuronal responses in visual area MT , 1998, Vision Research.

[12]  Joel Z. Leibo,et al.  Unsupervised learning of clutter-resistant visual representations from natural videos , 2014, ArXiv.

[13]  Radoslaw Martin Cichy,et al.  Resolving human object recognition in space and time , 2014, Nature Neuroscience.

[14]  Joel Z. Leibo,et al.  The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex , 2014, bioRxiv.

[15]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[16]  Tomaso Poggio,et al.  A fast, invariant representation for human action in the visual system. , 2016, Journal of neurophysiology.

[17]  Tomaso Poggio,et al.  CNS: a GPU-based framework for simulating cortically-organized networks , 2010 .

[18]  H. Bülthoff,et al.  Effects of temporal association on recognition memory , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[19]  P. Latham,et al.  Ruling out and ruling in neural codes , 2009, Proceedings of the National Academy of Sciences.

[20]  P. Downing,et al.  Selectivity for the human body in the fusiform gyrus. , 2005, Journal of neurophysiology.

[21]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[22]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Joel Z. Leibo,et al.  How Important Is Weight Symmetry in Backpropagation? , 2015, AAAI.

[24]  Joel Z. Leibo,et al.  View-Tolerant Face Recognition and Hebbian Learning Imply Mirror-Symmetric Neural Tuning to Head Orientation , 2016, Current Biology.

[25]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Antonio Torralba,et al.  Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence , 2016, Scientific Reports.

[27]  Sarah Wong,et al.  The Mitochondrial Lon Protease Is Required for Age-Specific and Sex-Specific Adaptation to Oxidative Stress , 2017, Current Biology.

[28]  J. DiCarlo,et al.  Using goal-driven deep learning models to understand sensory cortex , 2016, Nature Neuroscience.

[29]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[30]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[31]  Edmund T. Rolls,et al.  Learning invariant object recognition in the visual system with continuous transformations , 2006, Biological Cybernetics.

[32]  Eero P. Simoncelli,et al.  Spatiotemporal Elements of Macaque V1 Receptive Fields , 2005, Neuron.

[33]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[34]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[37]  Tomaso Poggio,et al.  Learning to discount transformations as the computational goal of visual cortex , 2011 .

[38]  J. Haxby,et al.  fMRI Responses to Video and Point-Light Displays of Moving Humans and Manipulable Objects , 2003, Journal of Cognitive Neuroscience.

[39]  Lorenzo Rosasco,et al.  GURLS: a least squares library for supervised learning , 2013, J. Mach. Learn. Res..

[40]  R. Blake,et al.  Brain Areas Active during Visual Perception of Biological Motion , 2002, Neuron.

[41]  A. J. Mistlin,et al.  Visual analysis of body movements by neurones in the temporal cortex of the macaque monkey: A preliminary report , 1985, Behavioural Brain Research.

[42]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[44]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[45]  D. Sheinberg,et al.  Temporal Cortex Neurons Encode Articulated Actions as Slow Sequences of Integrated Poses , 2010, The Journal of Neuroscience.

[46]  Joel Z. Leibo,et al.  Invariant Recognition Predicts Tuning of Neurons in Sensory Cortex , 2017 .

[47]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[48]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[49]  Yoshua Bengio,et al.  Towards Biologically Plausible Deep Learning , 2015, ArXiv.

[50]  Fabio Anselmi,et al.  Visual Cortex and Deep Networks: Learning Invariant Representations , 2016 .

[51]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[52]  T. Poggio,et al.  Cognitive neuroscience: Neural mechanisms for the recognition of biological movements , 2003, Nature Reviews Neuroscience.

[53]  R. Lemon,et al.  What We Know Currently about Mirror Neurons , 2013, Current Biology.

[54]  Joel Z. Leibo,et al.  The dynamics of invariant object recognition in the human visual system. , 2014, Journal of neurophysiology.

[55]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[56]  Joris Vangeneugden,et al.  Distinct Neural Mechanisms for Body Form and Body Motion Discriminations , 2014, The Journal of Neuroscience.

[57]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations , 2016, Theor. Comput. Sci..

[58]  Thomas Serre,et al.  Neural representation of action sequences: how far can a simple snippet-matching model take us? , 2013, NIPS.

[59]  Thomas B. Moeslund,et al.  A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[60]  John A. Pyles,et al.  fMR-Adaptation Reveals Invariant Coding of Biological Motion on the Human STS , 2009, Front. Hum. Neurosci..

[61]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[62]  P. Sinha,et al.  Functional neuroanatomy of biological motion perception in humans , 2001, Proceedings of the National Academy of Sciences of the United States of America.