Populating 3D Scenes by Learning Human-Scene Interaction

Humans live within a 3D space and constantly interact with it to perform tasks. Such interactions involve physical contact between surfaces that is semantically meaningful. Our goal is to learn how humans interact with scenes and leverage this to enable virtual characters to do the same. To that end, we introduce a novel Human-Scene Interaction (HSI) model that encodes proximal relationships, called POSA for “Pose with prOximitieS and contActs”. The representation of interaction is body-centric, which enables it to generalize to new scenes. Specifically, POSA augments the SMPL-X parametric human body model such that, for every mesh vertex, it encodes (a) the contact probability with the scene surface and (b) the corresponding semantic scene label. We learn POSA with a VAE conditioned on the SMPL-X vertices, and train on the PROX dataset, which contains SMPL-X meshes of people interacting with 3D scenes, and the corresponding scene semantics from the PROX-E dataset. We demonstrate the value of POSA with two applications. First, we automatically place 3D scans of people in scenes. We use a SMPL-X model fit to the scan as a proxy and then find its most likely placement in 3D. POSA provides an effective representation to search for “affordances” in the scene that match the likely contact relationships for that pose. We perform a perceptual study that shows significant improvement over the state of the art on this task. Second, we show that POSA’s learned representation of body-scene interaction supports monocular human pose estimation that is consistent with a 3D scene, improving on the state of the art. Our model and code will be available for research purposes at https://posa.is.tue.mpg.de.

[1]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[2]  Kostas Daniilidis,et al.  Convolutional Mesh Regression for Single-Image Human Shape Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[4]  Michael J. Black,et al.  Generating 3D faces using Convolutional Mesh Autoencoders , 2018, ECCV.

[5]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiao Wang,et al.  Indirect shape analysis for 3D shape retrieval , 2015, Comput. Graph..

[7]  Yun Jiang,et al.  Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Matthias Nießner,et al.  State of the Art on 3D Reconstruction with RGB‐D Cameras , 2018, Comput. Graph. Forum.

[10]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Jitendra Malik,et al.  Gibson Env: Real-World Perception for Embodied Agents , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Michael Gleicher,et al.  Retargetting motion to new characters , 1998, SIGGRAPH.

[14]  Ying He,et al.  A Sketching Interface for Sitting Pose Design in the Virtual Environment , 2012, IEEE Transactions on Visualization and Computer Graphics.

[15]  Minh Vo,et al.  Long-term Human Motion Prediction with Scene Context , 2020, ECCV.

[16]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Leonidas J. Guibas,et al.  Learning a Generative Model for Multi‐Step Human‐Object Interactions from Videos , 2019, Comput. Graph. Forum.

[18]  Chenfanfu Jiang,et al.  Inferring Forces and Learning Human Utilities from Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Charles C. Kemp,et al.  ContactPose: A Dataset of Grasps with Object Contact and Hand Pose , 2020, ECCV.

[20]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[21]  Przemyslaw Musialski,et al.  Pose to Seat: Automated Design of Body-Supporting Surfaces , 2020, Comput. Aided Geom. Des..

[22]  Leonidas J. Guibas,et al.  ObjectNet3D: A Large Scale Database for 3D Object Recognition , 2016, ECCV.

[23]  Stefanos Zafeiriou,et al.  SpiralNet++: A Fast and Highly Efficient Mesh Convolution Operator , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[24]  Alexei A. Efros,et al.  From 3D scene geometry to human workspace , 2011, CVPR 2011.

[25]  Deva Ramanan,et al.  Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild , 2020, ECCV.

[26]  Michael J. Black,et al.  Dynamic FAUST: Registering Human Bodies in Motion , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Vladlen Koltun,et al.  A Large Dataset of Object Scans , 2016, ArXiv.

[28]  Leonidas J. Guibas,et al.  Understanding and Exploiting Object Interaction Landscapes , 2016, ACM Trans. Graph..

[29]  Stefanos Zafeiriou,et al.  Neural 3D Morphable Models: Spiral Convolutional Networks for 3D Shape Representation Learning and Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Francesc Moreno-Noguer,et al.  GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jehee Lee,et al.  Motion patches: buildings blocks for virtual environments annotated with motion data , 2005, SIGGRAPH Sketches.

[32]  Jehee Lee,et al.  Motion patches: building blocks for virtual environments annotated with motion data , 2006, ACM Trans. Graph..

[33]  Siddhartha S. Srinivasa,et al.  Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set , 2015, IEEE Robotics & Automation Magazine.

[34]  Francesc Moreno-Noguer,et al.  Context-Aware Human Motion Prediction , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Song-Chun Zhu,et al.  Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2015, ACM Trans. Graph..

[37]  Edmond S. L. Ho,et al.  Spatial relationship preserving character motion adaptation , 2010, ACM Trans. Graph..

[38]  Sung-Hee Lee,et al.  Environment-adaptive contact poses for virtual characters , 2014, SIGGRAPH '14.

[39]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[40]  Dieter Fox,et al.  6-DOF GraspNet: Variational Grasp Generation for Object Manipulation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Michael J. Black,et al.  STAR: Sparse Trained Articulated Human Body Regressor , 2020, ECCV.

[42]  Yan Zhang,et al.  Generating 3D People in Scenes Without People , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Cristian Sminchisescu,et al.  GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Niloy J. Mitra,et al.  Ergonomics-Inspired Reshaping and Exploration of Collections of Models , 2016, IEEE Transactions on Visualization and Computer Graphics.

[46]  Matthias Nießner,et al.  Activity-centric scene synthesis for functional 3D scene modeling , 2015, ACM Trans. Graph..

[47]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[48]  Dragomir Anguelov,et al.  SCAPE: shape completion and animation of people , 2005, ACM Trans. Graph..

[49]  Yan Zhang,et al.  PLACE: Proximity Learning of Articulation and Contact in 3D Environments , 2020, 2020 International Conference on 3D Vision (3DV).

[50]  Luc Van Gool,et al.  What makes a chair a chair? , 2011, CVPR 2011.

[51]  Abhinav Gupta,et al.  Binge Watching: Scaling Affordance Learning from Sitcoms , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Dimitrios Tzionas,et al.  GRAB: A Dataset of Whole-Body Human Grasping of Objects , 2020, ECCV.

[53]  HiltonAdrian,et al.  A survey of advances in vision-based human motion capture and analysis , 2006 .

[54]  Christoph Lassner,et al.  Efficient Learning on Point Clouds With Basis Point Sets , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Ioannis A. Kakadiaris,et al.  3D Human pose estimation: A review of the literature and analysis of covariates , 2016, Comput. Vis. Image Underst..

[56]  Ehud Rivlin,et al.  Functional 3D Object Classification Using Simulation of Embodied Agent , 2006, BMVC.

[57]  Joachim Tesch,et al.  AGORA: Avatars in Geography Optimized for Regression Analysis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Jan Kautz,et al.  Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Silvio Savarese,et al.  Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[60]  Taku Komura,et al.  Relationship descriptors for interactive motion adaptation , 2013, SCA '13.

[61]  Dimitrios Tzionas,et al.  Resolving 3D Human Pose Ambiguities With 3D Scene Constraints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Matthias Nießner,et al.  PiGraphs: learning interaction snapshots from observations , 2016, ACM Trans. Graph..

[63]  Pat Hanrahan,et al.  SceneGrok: inferring action maps in 3D environments , 2014, ACM Trans. Graph..

[64]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[65]  Taku Komura,et al.  Spatial relationship preserving character motion adaptation , 2010, ACM Trans. Graph..

[66]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.