论文信息 - Learning Joint Reconstruction of Hands and Manipulated Objects

Learning Joint Reconstruction of Hands and Manipulated Objects

Estimating hand-object manipulations is essential for in- terpreting and imitating human actions. Previous work has made significant progress towards reconstruction of hand poses and object shapes in isolation. Yet, reconstructing hands and objects during manipulation is a more challeng- ing task due to significant occlusions of both the hand and object. While presenting challenges, manipulations may also simplify the problem since the physics of contact re- stricts the space of valid hand-object configurations. For example, during manipulation, the hand and object should be in contact but not interpenetrate. In this work, we regu- larize the joint reconstruction of hands and objects with ma- nipulation constraints. We present an end-to-end learnable model that exploits a novel contact loss that favors phys- ically plausible hand-object constellations. Our approach improves grasp quality metrics over baselines, using RGB images as input. To train and evaluate the model, we also propose a new large-scale synthetic dataset, ObMan, with hand-object manipulations. We demonstrate the transfer- ability of ObMan-trained models to real data.

[1] Antti Oulasvirta,et al. Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[2] Deva Ramanan,et al. 3D Hand Pose Detection in Egocentric RGB-D Images , 2014, ECCV Workshops.

[3] Luc Van Gool,et al. An object-dependent hand pose prior from sparse training data , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4] Mathieu Aubry,et al. 3D-CODED: 3D Correspondences by Deep Deformation , 2018, ECCV.

[5] Xiaowei Zhou,et al. Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Michael J. Black,et al. SMPL: A Skinned Multi-Person Linear Model , 2015, ACM Trans. Graph..

[7] Pavlo Molchanov,et al. Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[8] John F. Canny,et al. Planning optimal grasps , 1992, Proceedings 1992 IEEE International Conference on Robotics and Automation.

[9] Shanxin Yuan,et al. First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10] Yaser Sheikh,et al. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Davide Maltoni,et al. CORe50: a New Dataset and Benchmark for Continuous Object Recognition , 2017, CoRL.

[12] Wei Liu,et al. Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[13] Jian Sun,et al. Cascaded hand pose regression , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Jiajun Wu,et al. MarrNet: 3D Shape Reconstruction via 2.5D Sketches , 2017, NIPS.

[15] Honglak Lee,et al. Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[16] Sebastian Scherer,et al. VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17] Luc Van Gool,et al. Motion Capture of Hands in Action Using Discriminative Salient Points , 2012, ECCV.

[18] Otmar Hilliges,et al. Cross-Modal Deep Variational Hand Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Marc Pollefeys,et al. Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[20] Thomas Brox,et al. Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21] Ken Perlin,et al. Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[22] Luc Van Gool,et al. Tracking a hand manipulating an object , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23] Qionghai Dai,et al. Video-based hand manipulation capture through composite motion control , 2013, ACM Trans. Graph..

[24] Ying Wu,et al. Modeling the constraints of human hand motion , 2000, Proceedings Workshop on Human Motion.

[25] Mathieu Aubry,et al. AtlasNet: A Papier-M\^ach\'e Approach to Learning 3D Surface Generation , 2018, CVPR 2018.

[26] Mingliang Chen,et al. 3D Hand Pose Tracking and Estimation Using Stereo Matching , 2016, ArXiv.

[27] Christian Theobalt,et al. Real-Time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Antonis A. Argyros,et al. 3D Tracking of Human Hands in Interaction with Unknown Objects , 2015, BMVC.

[30] Endri Dibra,et al. Monocular RGB Hand Pose Inference from Unsupervised Refinable Nets , 2018, CVPR 2018.

[31] Vincent Lepetit,et al. Training a Feedback Loop for Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32] Tatsuya Harada,et al. Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[34] Vincent Lepetit,et al. Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[35] David C. Hogg,et al. Towards 3D hand tracking using a deformable model , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[36] M. Pollefeys,et al. Unified Egocentric Recognition of 3 D Hand-Object Poses and Interactions , 2019 .

[37] Tae-Kyun Kim,et al. Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests , 2013, 2013 IEEE International Conference on Computer Vision.

[38] Kathleen M. Robinette,et al. Civilian American and European Surface Anthropometry Resource (CAESAR), Final Report. Volume 1. Summary , 2002 .

[39] Didier Stricker,et al. DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth , 2018, 2018 International Conference on 3D Vision (3DV).

[40] Paulo R. S. Mendonça,et al. Model-based 3D tracking of an articulated hand , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[41] Ying Wu,et al. Capturing natural hand articulation , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[42] Anis Sahbani,et al. An overview of 3D object grasp synthesis algorithms , 2012, Robotics Auton. Syst..

[43] Marc Pollefeys,et al. H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Lale Akarun,et al. Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests , 2012, ECCV.

[45] Yinda Zhang,et al. LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[46] Danica Kragic,et al. Hands in action: real-time 3D reconstruction of hands in interaction with objects , 2010, 2010 IEEE International Conference on Robotics and Automation.

[47] Tomas Akenine-Möller,et al. Fast, minimum storage ray/triangle intersection , 2005, SIGGRAPH Courses.

[48] Jitendra Malik,et al. Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[49] Cordelia Schmid,et al. Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Jitendra Malik,et al. End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51] Jianfei Cai,et al. Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images , 2018, ECCV.

[52] Luc Van Gool,et al. Online loop closure for real-time interactive 3D scanning , 2011, Comput. Vis. Image Underst..

[53] Michael Isard,et al. Partitioned Sampling, Articulated Objects, and Interface-Quality Hand Tracking , 2000, ECCV.

[54] Hao Su,et al. A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Marc Levoy,et al. Real-time 3D model acquisition , 2002, ACM Trans. Graph..

[56] Danica Kragic,et al. The GRASP Taxonomy of Human Grasp Types , 2016, IEEE Transactions on Human-Machine Systems.

[57] Dimitrios Tzionas,et al. 3D Object Reconstruction from Hand-Object Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[58] Matei T. Ciocarlie,et al. The Columbia grasp database , 2009, 2009 IEEE International Conference on Robotics and Automation.

[59] Antonis A. Argyros,et al. Hand-Object Contact Force Estimation from Markerless Visual Tracking , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60] Antonis A. Argyros,et al. Joint 3D Tracking of a Deformable Object in Interaction with a Hand , 2018, ECCV.

[61] Takeo Kanade,et al. Visual Tracking of High DOF Articulated Structures: an Application to Human Hand Tracking , 1994, ECCV.

[62] Antonis A. Argyros,et al. Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[63] Christian Theobalt,et al. GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64] Deva Ramanan,et al. First-person pose recognition using egocentric workspaces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65] Tomas Akenine-Möller,et al. Fast, Minimum Storage Ray-Triangle Intersection , 1997, J. Graphics, GPU, & Game Tools.

[66] Toby Sharp,et al. Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[67] Antonis A. Argyros,et al. Tracking the articulated motion of two strongly interacting hands , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[68] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[69] Peter K. Allen,et al. Graspit! A versatile simulator for robotic grasping , 2004, IEEE Robotics & Automation Magazine.

[70] Leonidas J. Guibas,et al. ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[71] Xinyu Liu,et al. Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics , 2017, Robotics: Science and Systems.

[72] Dimitrios Tzionas,et al. Embodied hands , 2017, ACM Trans. Graph..

[73] Silvio Savarese,et al. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[74] Antonis A. Argyros,et al. Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints , 2011, 2011 International Conference on Computer Vision.

[75] Antonis A. Argyros,et al. Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[76] Deva Ramanan,et al. Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).