Spatio-temporal convolutional networks explain neural representations of human actions

The ability to recognize the actions of others is a core component of human visual intelligence. Here we investigate the computational mechanisms that allow the human visual system to recognize actions. We use a novel dataset of well-controlled naturalistic videos of five actions performed by five actors at five viewpoints and extend a class of biologically inspired hierarchical computational models of object recognition to recognize actions from videos. We explore a number of variations within the class of convolutional neural networks and assess the performance of each model on a viewpoint invariant action recognition task as well as how each one matches human neural activity measured with magnetoencephalography (MEG). We show that feed-forward spatiotemporal convolutional neural networks perform well on invariant action recognition tasks and account for the majority of the explainable variance in the neural data. We expand recent advances comparing artificial systems and neural recordings to explore the importance of specific computational aspects of each model, such as invariance to complex transformations. Our analysis allows us to understand how visual cortex is organized and provides evidence as to why this organization has prevailed. We show that model features that improve performance on viewpoint invariant action recognition lead to model representations that better match human neural data. Our results show that spatio-temporal convolutional networks are a good model of how the human visual system solves action recognition and that robustness to complex transformations, such as 3D viewpoint invariance, is a specific computational goal driving the organization of visual processing in the human brain. Introduction Humans’ ability to recognize actions of others is a crucial aspect of visual perception. The degree of accuracy to which we can finely discern actions of others is largely 1 These authors contributed equally to this work 2 unaffected by transformations that, while substantially changing the visual appearance of a given scene, do not change the semantics of what we observe (e.g. discriminating between walk and run at two different views). Here we investigate the computational and algorithmic level aspects of the neural representations supporting our ability to recognize actions robustly to complex transformations. Throughout this paper we refer to action, by borrowing from the established taxonomy [1], as the middle-ground between action primitives (e.g. raise the left foot and move it forward) and activities (e.g. playing basketball). Actions are possibly cyclical sequences of primitives like walking or running. A number of computer vision approaches have been proposed to extract action information from videos. These methods can be organized along the dimension of space-time locality. At one end of this spectrum, global approaches rely on fitting the scene at hand to a joint-based model of human bodies and describing actions as sequences of joint configurations in time [2] or on descriptors of the entire scene in space and time [3]–[5]. Local approaches on the other hand describe a scene in a bottom up fashion by detecting the presence of features that are local in space and time; these local descriptors are then combined in more complex representations in a hierarchical manner [6]–[8]. Convolutional neural networks fall squarely into the space-time local approaches and have prevailed as the best performing methods on action recognition tasks [9], [10]. The basic architecture of these artificial systems is loosely inspired by the organization of visual cortex. The recent success of convolutional neural networks in a wide variety of aspects of perception, including object and face recognition [11]–[13] and their close relation to visual cortex has inspired the development of methods to match and compare representations encoded in non-invasive brain imaging as well as neurophysiology data to those produced by this class of artificial systems. Comparing neural data and computer models of perception, using representational similarity analysis (RSA) [14] has provided precise computational accounts of the visual representations underlying invariant object recognition. This line work has revealed that optimizing the parameters of a convolutional neural network for performance on simple discrimination tasks (e.g. object recognition) results in models that produce representations matching neural recordings in humans and monkeys [15]– [19]. Here we utilize and expand these methods to compare neural representations,

[1]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations , 2016, Theor. Comput. Sci..

[2]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Massimo Piccardi,et al.  Background subtraction techniques: a review , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[5]  Antonio Torralba,et al.  Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence , 2016, Scientific Reports.

[6]  Joel Z. Leibo,et al.  The dynamics of invariant object recognition in the human visual system. , 2014, Journal of neurophysiology.

[7]  Tomaso Poggio,et al.  A fast, invariant representation for human action in the visual system. , 2016, Journal of neurophysiology.

[8]  Tomaso Poggio,et al.  CNS: a GPU-based framework for simulating cortically-organized networks , 2010 .

[9]  H. Bülthoff,et al.  Effects of temporal association on recognition memory , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  J. Gallant,et al.  Identifying natural images from human brain activity , 2008, Nature.

[12]  Nikolaus Kriegeskorte,et al.  Representational Similarity Analysis – Connecting the Branches of Systems Neuroscience , 2008, Frontiers in systems neuroscience.

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Joshua B. Tenenbaum,et al.  Efficient analysis-by-synthesis in vision: A computational framework, behavioral tests, and modeling neuronal representations , 2015, Annual Meeting of the Cognitive Science Society.

[15]  Keiji Tanaka,et al.  Matching Categorical Object Representations in Inferior Temporal Cortex of Man and Monkey , 2008, Neuron.

[16]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[17]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Eero P. Simoncelli,et al.  How MT cells analyze the motion of visual patterns , 2006, Nature Neuroscience.

[20]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[21]  Lorenzo Rosasco,et al.  GURLS: a least squares library for supervised learning , 2013, J. Mach. Learn. Res..

[22]  Edmund T. Rolls,et al.  Learning invariant object recognition in the visual system with continuous transformations , 2006, Biological Cybernetics.

[23]  Eero P. Simoncelli,et al.  Spatiotemporal Elements of Macaque V1 Receptive Fields , 2005, Neuron.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[26]  Thomas B. Moeslund,et al.  A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[27]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[28]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[29]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[32]  Eero P. Simoncelli,et al.  A model of neuronal responses in visual area MT , 1998, Vision Research.

[33]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[35]  J. DiCarlo,et al.  Using goal-driven deep learning models to understand sensory cortex , 2016, Nature Neuroscience.

[36]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[37]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[38]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[39]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[40]  Joel Z. Leibo,et al.  How can cells in the anterior medial face patch be viewpoint invariant , 2011 .

[41]  Radoslaw Martin Cichy,et al.  Resolving human object recognition in space and time , 2014, Nature Neuroscience.

[42]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.