论文信息 - TEMOS: Generating diverse human motions from textual descriptions

TEMOS: Generating diverse human motions from textual descriptions

. We address the problem of generating diverse 3D human motions from textual descriptions. This challenging task requires joint modeling of both modalities: understanding and extracting useful human-centric information from the text, and then generating plausible and realistic sequences of human poses. In contrast to most previous work which focuses on generating a single, deterministic, motion from a textual description, we design a variational approach that can produce multiple diverse human motions. We propose TEMOS , a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space. We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions. We evaluate our approach on the KIT Motion-Language benchmark and, despite being relatively straightforward, demonstrate signiﬁcant improvements over the state of the art. Code and models are available on our project page.

Michael J. Black | Gül Varol | Mathis Petrovich

[1] Jianfeng Gao,et al. Unified Contrastive Learning in Image-Text-Label Space , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] T. Komura,et al. FaceFormer: Speech-Driven 3D Facial Animation with Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Dimitrios Tzionas,et al. Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[4] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[5] Nicholas Rewkowski,et al. Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning , 2021, ACM Multimedia.

[6] Ben Saunders,et al. Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7] Yaser Sheikh,et al. MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8] Michael J. Black,et al. Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] Philipp Slusallek,et al. Synthesis of Compositional Animations from Textual Descriptions , 2021, ArXiv.

[11] Zhengxia Zou,et al. Single-Shot Motion Completion with Transformer , 2021, ArXiv.

[12] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[13] David A. Ross,et al. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14] Michael J. Black,et al. We are More than Our Joints: Predicting how 3D Bodies Move , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Sanja Fidler,et al. Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.

[16] Shihao Zou,et al. Action2Motion: Conditioned Generation of 3D Human Motions , 2020, ACM Multimedia.

[17] Michael J. Black,et al. Perpetual Motion: Generating Unbounded Human Motion , 2020, ArXiv.

[18] Qiang Ji,et al. Bayesian Adversarial Human Motion Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Cristian Sminchisescu,et al. Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows , 2020, ECCV.

[20] Kris M. Kitani,et al. DLow: Diversifying Latent Flows for Diverse Human Motion Prediction , 2020, ECCV.

[21] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[22] Jonas Beskow,et al. MoGlow , 2019, ACM Trans. Graph..

[23] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[25] Dahua Lin,et al. Convolutional Sequence Generation for Skeleton-Based Action Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Otmar Hilliges,et al. Structured Prediction Helps 3D Human Motion Modelling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28] Louis-Philippe Morency,et al. Language2Pose: Natural Language Grounded Pose Forecasting , 2019, 2019 International Conference on 3D Vision (3DV).

[29] Jitendra Malik,et al. Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Michael J. Black,et al. Capture, Learning, and Synthesis of 3D Speaking Styles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Nikolaus F. Troje,et al. AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Yi Zhou,et al. On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[34] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35] Tetsuya Ogata,et al. Paired Recurrent Autoencoders for Bidirectional Translation Between Robot Actions and Linguistic Descriptions , 2018, IEEE Robotics and Automation Letters.

[36] Dario Pavllo,et al. QuaterNet: A Quaternion-based Recurrent Model for Human Motion , 2018, BMVC.

[37] Xiao Lin,et al. Human Motion Modeling using DVGANs , 2018, ArXiv.

[38] Zicheng Liu,et al. HP-GAN: Probabilistic 3D Human Motion Prediction via GAN , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39] Timothy Ha,et al. Text2Action: Generative Adversarial Synthesis from Language to Action , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[40] Tamim Asfour,et al. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks , 2017, Robotics Auton. Syst..

[41] Raymond J. Mooney,et al. Generating Animated Videos of Human Activities from Natural Language Descriptions , 2018 .

[42] Jaakko Lehtinen,et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[43] Taku Komura,et al. A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[44] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[45] Michael J. Black,et al. On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Tamim Asfour,et al. The KIT Motion-Language Dataset , 2016, Big Data.

[47] Taku Komura,et al. A deep learning framework for character motion synthesis and editing , 2016, ACM Trans. Graph..

[48] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Karrie Karahalios,et al. DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization , 2015, UIST.

[50] Michael J. Black,et al. SMPL: A Skinned Multi-Person Linear Model , 2015, ACM Trans. Graph..

[51] Tamim Asfour,et al. The KIT whole-body human motion database , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[52] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53] Stefan Ulbrich,et al. Master Motor Map (MMM) — Framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots , 2014, 2014 IEEE-RAS International Conference on Humanoid Robots.

[54] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[56] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[57] Cristian Sminchisescu,et al. Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[58] Eduardo de Campos Valadares,et al. Dancing to the music , 2000 .

[59] Michael J. Coombs,et al. Designing for Human-Computer Communication , 1983 .