On Human Motion Prediction Using Recurrent Neural Networks

Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion, with the goal of learning time-dependent representations that perform tasks such as short-term motion prediction and long-term human motion synthesis. We examine recent work, with a focus on the evaluation methodologies commonly used in the literature, and show that, surprisingly, state of the art performance can be achieved by a simple baseline that does not attempt to model motion at all. We investigate this result, and analyze recent RNN methods by looking at the architectures, loss functions, and training procedures used in state-of-the-art approaches. We propose three changes to the standard RNN models typically used for human motion, which results in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction.

[1]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  David J. Fleet,et al.  3D People Tracking with Gaussian Process Dynamical Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Mark H. Overmars,et al.  Real Time Animation of Virtual Humans: A Trade‐off Between Naturalness and Control , 2010, Comput. Graph. Forum.

[5]  Okan Arikan,et al.  Interactive motion generation from examples , 2002, ACM Trans. Graph..

[6]  Konrad Paul Kording,et al.  The statistics of natural hand movements , 2008, Experimental Brain Research.

[7]  Jessica K. Hodgins,et al.  Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[8]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[9]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[10]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[11]  Luc Van Gool,et al.  Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities , 2011, NIPS.

[12]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[13]  Vladimir Pavlovic,et al.  Learning Switching Linear Models of Human Motion , 2000, NIPS.

[14]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[15]  Sebastian Nowozin,et al.  A Non-parametric Bayesian Network Prior of Human Pose , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Michael J. Black,et al.  Parameterized Modeling and Recognition of Activities , 1999, Comput. Vis. Image Underst..

[17]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[18]  Cristian Sminchisescu,et al.  Generative modeling for continuous non-linearly embedded visual inference , 2004, ICML.

[19]  David J. Fleet,et al.  Gaussian Process Dynamical Models , 2005, NIPS.

[20]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[21]  Rui Li,et al.  Monocular Tracking of 3D Human Motion with a Coordinated Mixture of Factor Analyzers , 2006, ECCV.

[22]  Rama Chellappa,et al.  Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yang Wang,et al.  Semi-Latent Dirichlet Allocation: A Hierarchical Model for Human Action Recognition , 2007, Workshop on Human Motion.

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[27]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  N. Troje Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. , 2002, Journal of vision.

[29]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[30]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[31]  David J. Fleet,et al.  Stochastic Tracking of 3D Human Figures Using 2D Image Motion , 2000, ECCV.

[32]  Sebastian Nowozin,et al.  Efficient Nonlinear Markov Models for Human Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Allan Jabri,et al.  Revisiting Visual Question Answering Baselines , 2016, ECCV.

[34]  David J. Fleet,et al.  Physics-Based Person Tracking Using the Anthropomorphic Walker , 2010, International Journal of Computer Vision.

[35]  Marco La Cascia,et al.  Hankelet-based dynamical systems modeling for 3D action recognition , 2015, Image Vis. Comput..

[36]  Sergey Levine,et al.  Continuous character control with low-dimensional embeddings , 2012, ACM Trans. Graph..

[37]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[38]  Lucas Kovar,et al.  Motion graphs , 2002, SIGGRAPH '08.

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[41]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[42]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Stephen J. Maybank,et al.  Action classification using a discriminative multilevel HDP-HMM , 2015, Neurocomputing.

[44]  Behrooz Mahasseni,et al.  Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  James J. Little,et al.  3D Pose from Motion for Cross-View Action Recognition via Non-linear Circulant Temporal Encoding , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[47]  David J. Fleet,et al.  Temporal motion models for monocular and multiview 3D human body tracking , 2006, Comput. Vis. Image Underst..

[48]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[50]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[51]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[52]  Maja J. Mataric,et al.  A spatio-temporal extension to Isomap nonlinear dimension reduction , 2004, ICML.