A Closer Look at Spatiotemporal Convolutions for Action Recognition

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.

[1]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[2]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[4]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[5]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[6]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[7]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[8]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[9]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[10]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[11]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[15]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Тараса Шевченка,et al.  Quo vadis? , 2013, Clinical chemistry.

[18]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[19]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Lisa Anne Hendricks,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2015, CVPR.

[23]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[24]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[33]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[37]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[38]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[40]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[43]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.