Generating Person-Scene Interactions in 3D Scenes

High fidelity digital 3D environments have been proposed in recent years; however, it remains extreme challenging to automatically equip such environment with realistic human bodies. Existing work utilizes images, depths, or semantic maps to represent the scene, and parametric human models to represent 3D bodies in the scene. While being straightforward, their generated human-scene interactions are often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body-scene contact. To explicitly and effectively represent the physical contact between the body and the world is essential for modeling human-scene interaction. To that end, we propose a novel interaction representation, which explicitly encodes the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the distance from every basis point to its closest point on a human body. The synthesized proximal relationship between the human body and the scene can indicate which region a person tends to contact. Furthermore, based on such synthesized proximity, we can effectively obtain expressive 3D human bodies that naturally interact with the 3D scene. Our perceptual study shows that our model significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction. We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes. Our code and model will be publicly available for research purpose.

[1]  Junqi Liu,et al.  Detailed 2D-3D Joint Representation for Human-Object Interaction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yin Li,et al.  Compositional Learning for Human Object Interaction , 2018, ECCV.

[3]  Yan Zhang,et al.  Generating 3D People in Scenes Without People , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[5]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[6]  Sinisa Todorovic,et al.  A Multi-scale CNN for Affordance Segmentation in RGB Images , 2016, ECCV.

[7]  Jan Kautz,et al.  Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Mathieu Aubry,et al.  Learning elementary structures for 3D shape generation and matching , 2019, NeurIPS.

[9]  Michael J. Black,et al.  Learning to Dress 3D People in Generative Clothing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Chenfanfu Jiang,et al.  Human-Centric Indoor Scene Synthesis Using Stochastic Grammar , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Mathieu Aubry,et al.  3D-CODED: 3D Correspondences by Deep Deformation , 2018, ECCV.

[13]  Wenguan Wang,et al.  Cascaded Human-Object Interaction Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Li Fei-Fei,et al.  Scaling Human-Object Interaction Recognition Through Zero-Shot Learning , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[16]  Dimitrios Tzionas,et al.  Resolving 3D Human Pose Ambiguities With 3D Scene Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  James M. Rehg,et al.  Forecasting Human Object Interaction: Joint Prediction of Motor Attention and Egocentric Activity , 2019, ArXiv.

[18]  Kaiming He,et al.  Detecting and Recognizing Human-Object Interactions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Leonidas J. Guibas,et al.  Shape2Pose: human-centric shape analysis , 2014, ACM Trans. Graph..

[20]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[21]  Jan Kautz,et al.  Context-aware Synthesis and Placement of Object Instances , 2018, NeurIPS.

[22]  Ivan Laptev,et al.  Learning person-object interactions for action recognition in still images , 2011, NIPS.

[23]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[25]  Eduardo Ruiz,et al.  Geometric Affordance Perception: Leveraging Deep 3D Saliency With the Interaction Tensor , 2020, Frontiers in Neurorobotics.

[26]  Song-Chun Zhu,et al.  Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Christoph Lassner,et al.  Efficient Learning on Point Clouds With Basis Point Sets , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Michel Barlaud,et al.  Two deterministic half-quadratic regularization algorithms for computed imaging , 1994, Proceedings of 1st International Conference on Image Processing.

[29]  Cewu Lu,et al.  Pairwise Body-Part Attention for Recognizing Human-Object Interactions , 2018, ECCV.

[30]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[32]  In So Kweon,et al.  Detecting Human-Object Interactions with Action Co-occurrence Priors , 2020, ECCV.

[33]  Fahad Shahbaz Khan,et al.  Learning Human-Object Interaction Detection Using Interaction Points , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Pat Hanrahan,et al.  SceneGrok: inferring action maps in 3D environments , 2014, ACM Trans. Graph..

[35]  Cordelia Schmid,et al.  On the Importance of Visual Context for Data Augmentation in Scene Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Benjamin Cohen,et al.  Where and Who? Automatic Semantic-Aware Person Composition , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[38]  Michael J. Black,et al.  Perpetual Motion: Generating Unbounded Human Motion , 2020, ArXiv.

[39]  Cewu Lu,et al.  Transferable Interactiveness Knowledge for Human-Object Interaction Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Matthias Nießner,et al.  PiGraphs: learning interaction snapshots from observations , 2016, ACM Trans. Graph..

[41]  Chenfanfu Jiang,et al.  Inferring Forces and Learning Human Utilities from Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Minh Vo,et al.  Long-term Human Motion Prediction with Scene Context , 2020, ECCV.

[43]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.