NAS-TC: Neural Architecture Search on Temporal Convolutions for Complex Action Recognition

In the field of complex action recognition in videos, the quality of the designed model plays a crucial role in the final performance. However, artificially designed network structures often rely heavily on the researchers’ knowledge and experience. Accordingly, because of the automated design of its network structure, Neural architecture search (NAS) has achieved great success in the image processing field and attracted substantial research attention in recent years. Although some NAS methods have reduced the number of GPU search days required to single digits in the image field, directly using 3D convolution to extend NAS to the video field is still likely to produce a surge in computing volume. To address this challenge, we propose a new processing framework called Neural Architecture SearchTemporal Convolutional (NAS-TC). Our proposed framework is divided into two phases. In the first phase, the classical CNN network is used as the backbone network to complete the computationally intensive feature extraction task. In the second stage, a simple stitching search to the cell is used to complete the relatively lightweight long-range temporal-dependent information extraction. This ensures our method will have more reasonable parameter assignments and can handle minute-level videos. Finally, we conduct sufficient experiments on multiple benchmark datasets and obtain competitive recognition accuracy.

[1]  Zi Huang,et al.  Exploring Consistent Preferences: Discrete Hashing with Pair-Exemplar for Scalable Landmark Search , 2017, ACM Multimedia.

[2]  Geoffrey J. Gordon,et al.  DeepArchitect: Automatically Designing and Training Deep Architectures , 2017, ArXiv.

[3]  Yoonsuck Choe,et al.  Action Recognition and State Change Prediction in a Recipe Understanding Task Using a Lightweight Neural Network Model , 2020, AAAI.

[4]  Thomas Brox,et al.  Understanding and Robustifying Differentiable Architecture Search , 2020, ICLR.

[5]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Qi Tian,et al.  Progressive Differentiable Architecture Search: Bridging the Depth Gap Between Search and Evaluation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Arnold W. M. Smeulders,et al.  Timeception for Complex Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Deva Ramanan,et al.  Predictive-Corrective Networks for Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Chao Li,et al.  Collaborative Spatiotemporal Feature Learning for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Bernard Ghanem,et al.  SGAS: Sequential Greedy Architecture Search , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael S. Ryoo,et al.  Temporal Gaussian Mixture Layer for Videos , 2018, ICML.

[14]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[15]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Jianhuang Lai,et al.  Contour-Aware Loss: Boundary-Aware Learning for Salient Object Segmentation , 2020, IEEE Transactions on Image Processing.

[17]  Ali Farhadi,et al.  Asynchronous Temporal Fields for Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xiaojun Chang,et al.  Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kirthevasan Kandasamy,et al.  Neural Architecture Search with Bayesian Optimisation and Optimal Transport , 2018, NeurIPS.

[20]  Yi Yang,et al.  They are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[22]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[24]  Yi Yang,et al.  RCAA: Relational Context-Aware Agents for Person Search , 2018, ECCV.

[25]  Song Han,et al.  Path-Level Network Transformation for Efficient Architecture Search , 2018, ICML.

[26]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Weilin Huang,et al.  Knowledge Integration Networks for Action Recognition , 2020, AAAI.

[28]  Arnold W. M. Smeulders,et al.  Unified Embedding and Metric Learning for Zero-Exemplar Event Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Dong Xu,et al.  Block Proposal Neural Architecture Search , 2020, IEEE Transactions on Image Processing.

[30]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[31]  Andrea Vedaldi,et al.  Transactions on Pattern Analysis and Machine Intelligence 1 Action Recognition with Dynamic Image Networks , 2022 .

[32]  Li Fei-Fei,et al.  Progressive Neural Architecture Search , 2017, ECCV.

[33]  Michael S. Ryoo,et al.  Evolving Space-Time Neural Architectures for Videos , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Qinghua Zheng,et al.  Semantics-Preserving Graph Propagation for Zero-Shot Object Detection , 2020, IEEE Transactions on Image Processing.

[36]  Cees Snoek,et al.  Video2vec Embeddings Recognize Events When Examples Are Scarce , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Cees Snoek,et al.  Video Time: Properties, Encoders and Evaluation , 2018, BMVC.

[38]  Debasmit Das,et al.  A Two-Stage Approach to Few-Shot Learning for Image Recognition , 2019, IEEE Transactions on Image Processing.

[39]  Efstratios Gavves,et al.  VideoGraph: Recognizing Minutes-Long Human Activities in Videos , 2019, ArXiv.

[40]  Yi Yang,et al.  A Convex Formulation for Spectral Shrunk Clustering , 2015, AAAI.

[41]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[44]  Bob Zhang,et al.  Learning Complete and Discriminative Direction Pattern for Robust Palmprint Recognition , 2020, IEEE Transactions on Image Processing.

[45]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[46]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Xiangtao Zheng,et al.  Bidirectional Interaction Network for Person Re-Identification , 2021, IEEE Transactions on Image Processing.

[49]  Mehrtash Harandi,et al.  Hierarchical Neural Architecture Search for Deep Stereo Matching , 2020, NeurIPS.

[50]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[52]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[53]  Yunhao Liu,et al.  Making Sense of Spatio-Temporal Preserving Representations for EEG-Based Human Intention Recognition , 2020, IEEE Transactions on Cybernetics.

[54]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[55]  Quoc V. Le,et al.  Large-Scale Evolution of Image Classifiers , 2017, ICML.

[56]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[57]  Huiqi Li,et al.  Overcoming Multi-Model Forgetting in One-Shot NAS With Diversity Maximization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[59]  Yong Yu,et al.  Efficient Architecture Search by Network Transformation , 2017, AAAI.

[60]  Zhihui Li,et al.  A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions , 2020, ArXiv.

[61]  Michael S. Ryoo,et al.  Learning Latent Super-Events to Detect Multiple Activities in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Abhinav Gupta,et al.  ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Nicu Sebe,et al.  Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Xiaopeng Zhang,et al.  PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search , 2020, ICLR.

[66]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[67]  Zhihui Li,et al.  Deep Feature Learning via Structured Graph Laplacian Embedding for Person Re-Identification , 2017, Pattern Recognit..

[68]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[69]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).