TrajectoryNet: a new spatio-temporal feature learning network for human motion prediction.

Human motion prediction is an increasingly interesting topic in computer vision and robotics. In this paper, we propose a new 2D CNN based network, TrajectoryNet, to predict future poses in the trajectory space. Compared with most existing methods, our model focuses on modeling the motion dynamics with coupled spatio-temporal features, local-global spatial features and global temporal co-occurrence features of the previous pose sequence. Specifically, the coupled spatio-temporal features describe the spatial and temporal structure information hidden in the natural human motion sequence, which can be mined by covering the space and time dimensions of the input pose sequence with the convolutional filters. The local-global spatial features that encode different correlations of different joints of the human body (e.g. strong correlations between joints of one limb, weak correlations between joints of different limbs) are captured hierarchically by enlarging the receptive field layer by layer and residual connections from the lower layers to the deeper layers in our proposed convolutional network. And the global temporal co-occurrence features represent the co-occurrence relationship that different subsequences in a complex motion sequence are appeared simultaneously, which can be obtained automatically with our proposed TrajectoryNet by reorganizing the temporal information as the depth dimension of the input tensor. Finally, future poses are approximated based on the captured motion dynamics features. Extensive experiments show that our method achieves state-of-the-art performance on three challenging benchmarks (e.g. Human3.6M, G3D, and FNTU), which demonstrates the effectiveness of our proposed method. The code will be available if the paper is accepted.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Juan Carlos Niebles,et al.  Action-Agnostic Human Pose Forecasting , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[4]  Dimitrios Makris,et al.  G3D: A gaming action dataset and real time action recognition evaluation framework , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[5]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Roger Zimmermann,et al.  Towards Natural and Accurate Future Motion Prediction of Humans and Animals , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yilong Yin,et al.  PISEP2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {PISEP}{^2}$$\end{document}: pseudo-image sequence evoluti , 2019, The Visual Computer.

[9]  Yu Tian,et al.  Learning to Forecast and Refine Residual Motion for Image-to-Video Generation , 2018, ECCV.

[10]  Jiashi Feng,et al.  VRED: A Position-Velocity Recurrent Encoder-Decoder for Human Motion Prediction , 2019, ArXiv.

[11]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Rogério Schmidt Feris,et al.  A Recurrent Encoder-Decoder Network for Sequential Face Alignment , 2016, ECCV.

[13]  Xiao Guo,et al.  Human Motion Prediction via Learning Local Structure Representations and Temporal Dependencies , 2019, AAAI.

[14]  Haroon Idrees,et al.  Online Localization and Prediction of Actions and Interactions , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Gang Wang,et al.  Skeleton-Based Online Action Prediction Using Scale Selection Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[18]  Jianmin Wang,et al.  PredCNN: Predictive Learning with Cascade Convolutions , 2018, IJCAI.

[19]  Yong Du,et al.  Skeleton based action recognition with convolutional neural network , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[20]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[21]  José M. F. Moura,et al.  Few-Shot Human Motion Prediction via Meta-learning , 2018, ECCV.

[22]  Joseph Hamill,et al.  Biomechanical Basis of Human Movement , 1995 .

[23]  Zhen Zhang,et al.  Convolutional Sequence to Sequence Model for Human Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  R. Venkatesh Babu,et al.  BiHMP-GAN: Bidirectional 3D Human Motion Prediction GAN , 2018, AAAI.

[25]  Meng Wang,et al.  Online Early-Late Fusion Based on Adaptive HMM for Sign Language Recognition , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[26]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[27]  Dario Pavllo,et al.  Modeling Human Motion with Quaternion-Based Neural Networks , 2019, International Journal of Computer Vision.

[28]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Dan Guo,et al.  Connectionist Temporal Modeling of Video and Language: a Joint Model for Translation and Sign Labeling , 2019, IJCAI.

[31]  Yun Fu,et al.  Human Action Recognition and Prediction: A Survey , 2018, International Journal of Computer Vision.

[32]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[33]  Chao Li,et al.  Skeleton-based action recognition with convolutional neural networks , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[34]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Hongdong Li,et al.  Learning Trajectory Dependencies for Human Motion Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Guillaume Gravier,et al.  One-Step Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network , 2016, ICIAP.

[40]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  José M. F. Moura,et al.  Adversarial Geometry-Aware Human Motion Prediction , 2018, ECCV.

[42]  Hassan Foroosh,et al.  A Temporal Sequence Learning for Action Recognition and Prediction , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[43]  Xiaosong Yang,et al.  Efficient convolutional hierarchical autoencoder for human motion prediction , 2019, The Visual Computer.