Collaborative Regression of Expressive Bodies using Moderation

Recovering expressive humans from images is essential for understanding human behavior. Methods that estimate 3D bodies, faces, or hands have progressed significantly, yet separately. Face methods recover accurate 3D shape and geometric details, but need a tight crop and struggle with extreme views and low resolution. Whole-body methods are robust to a wide range of poses and resolutions, but provide only a rough 3D face shape without details like wrinkles. To get the best of both worlds, we introduce PIXIE, which produces animatable, whole-body 3D avatars with realistic facial detail, from a single image. For this, PIXIE uses two key observations. First, existing work combines independent estimates from body, face, and hand experts, by trusting them equally. PIXIE introduces a novel moderator that merges the features of the experts, weighted by their confidence. All part experts can contribute to the whole, using SMPL-X’s shared shape space across all body parts. Second, human shape is highly correlated with gender, but existing work ignores this. We label training images as male, female, or non-binary, and train PIXIE to infer “gendered” 3D body shapes with a novel shape loss. In addition to 3D body pose and shape parameters, PIXIE estimates expression, illumination, albedo and 3D facial surface displacements. Quantitative and qualitative evaluation shows that PIXIE estimates more accurate whole-shape and detailed face shape than the state of the art. Models and code are available at pixie.is.tue.mpg.de.

[1]  Takaaki Shiratori,et al.  FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration , 2020, ArXiv.

[2]  Lourdes Agapito,et al.  Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Hsi-Jian Lee,et al.  Determination of 3D human body postures from a single view , 1985, Comput. Vis. Graph. Image Process..

[4]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[5]  Zhenan Sun,et al.  Learning 3D Human Shape and Pose From Dense Body Parts , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yichen Wei,et al.  Compositional Human Pose Regression , 2018, Comput. Vis. Image Underst..

[8]  Jiaolong Yang,et al.  Accurate 3D Face Reconstruction With Weakly-Supervised Learning: From Single Image to Image Set , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[11]  Pascal Fua,et al.  What Face and Body Shapes Can Tell Us About Height , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[12]  Tal Hassner,et al.  Regressing Robust and Discriminative 3D Morphable Models with a Very Deep Neural Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Thomas Vetter,et al.  Estimating Coloured 3D Face Models from Single Images: An Example Based Approach , 1998, ECCV.

[14]  Iasonas Kokkinos,et al.  Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Michael J. Black,et al.  PARE: Part Attention Regressor for 3D Human Body Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Stefanos Zafeiriou,et al.  Single Image 3D Hand Reconstruction with Mesh Convolutions , 2019, BMVC.

[18]  Michael J. Black,et al.  On Self-Contact and Human Pose , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Vincent Leroy,et al.  DOPE: Distillation Of Part Experts for whole-body 3D pose estimation in the wild , 2020, ECCV.

[20]  Stefanos Zafeiriou,et al.  GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yebin Liu,et al.  Mask-Pose Cascaded CNN for 2D Hand Pose Estimation From Single Color Image , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Lijuan Wang,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Cordelia Schmid,et al.  Moulding Humans: Non-Parametric 3D Human Shape Estimation From Single Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Marc Pollefeys,et al.  H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[28]  Kostas Daniilidis,et al.  Convolutional Mesh Regression for Single-Image Human Shape Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Vincent Lepetit,et al.  HOnnotate: A Method for 3D Annotation of Hand and Object Poses , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Qiang Li,et al.  End-to-End Hand Mesh Recovery From a Monocular RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Yinghao Huang,et al.  Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Ruigang Yang,et al.  FaceScape: A Large-Scale High Quality 3D Face Dataset and Detailed Riggable 3D Face Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Konrad Schindler,et al.  Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations , 2020, AAAI.

[35]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[36]  Wan-Yen Lo,et al.  Accelerating 3D deep learning with PyTorch3D , 2019, SIGGRAPH Asia 2020 Courses.

[37]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[38]  Kathleen M. Robinette,et al.  Civilian American and European Surface Anthropometry Resource (CAESAR), Final Report. Volume 1. Summary , 2002 .

[39]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Michael J. Black,et al.  Learning to Regress 3D Face Shape and Expression From an Image Without 3D Supervision , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[43]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[44]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Yaser Sheikh,et al.  Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Michael J. Black,et al.  Detailed, Accurate, Human Shape Estimation from Clothed 3D Scan Sequences , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Takaaki Shiratori,et al.  FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[50]  William T. Freeman,et al.  Unsupervised Training for 3D Morphable Model Regression , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Dimitrios Tzionas,et al.  Monocular Expressive Body Regression through Body-Driven Attention , 2020, ECCV.

[53]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[54]  Michael J. Black,et al.  Estimating human shape and pose from a single image , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[55]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[57]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Feng Liu,et al.  Towards High-Fidelity Nonlinear 3D Face Morphable Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Andrea Vedaldi,et al.  Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation , 2020, 2021 International Conference on 3D Vision (3DV).

[62]  Michael J. Black,et al.  Learning to Dress 3D People in Generative Clothing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[64]  Wanli Ouyang,et al.  3D Human Mesh Regression With Dense Correspondence , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Trevor Darrell,et al.  Inferring 3D structure with a statistical image-based shape model , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[66]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[67]  Joachim Tesch,et al.  AGORA: Avatars in Geography Optimized for Regression Analysis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  M. Pollefeys,et al.  Unified Egocentric Recognition of 3 D Hand-Object Poses and Interactions , 2019 .

[69]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[70]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Michael J. Black,et al.  Learning a model of facial shape and expression from 4D scans , 2017, ACM Trans. Graph..

[72]  Tao Yu,et al.  DeepHuman: 3D Human Reconstruction From a Single Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[73]  Georgios Tzimiropoulos,et al.  Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[74]  Iasonas Kokkinos,et al.  HoloPose: Holistic 3D Human Reconstruction In-The-Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Cordelia Schmid,et al.  BodyNet: Volumetric Inference of 3D Human Body Shapes , 2018, ECCV.

[76]  Jin Xu,et al.  Whole-Body Human Pose Estimation in the Wild , 2020, ECCV.

[77]  Luc Van Gool,et al.  DEX: Deep EXpectation of Apparent Age from a Single Image , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[78]  Xi Zhou,et al.  Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network , 2018, ECCV.

[79]  Oswald Aldrian,et al.  Inverse Rendering of Faces with a 3D Morphable Model , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Cheng Li,et al.  Delving Deep Into Hybrid Annotations for 3D Human Recovery in the Wild , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[81]  Cristian Sminchisescu,et al.  Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes: The Importance of Multiple Scene Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[82]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Jianzhu Guo,et al.  Towards Fast, Accurate and Stable 3D Dense Face Alignment , 2020, ECCV.

[84]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Hanbyul Joo,et al.  PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Francesc Moreno-Noguer,et al.  SMPLicit: Topology-aware Generative Model for Clothed People , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Cristian Sminchisescu,et al.  Learning Complex 3D Human Self-Contact , 2020, AAAI.

[88]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2015, ACM Trans. Graph..

[89]  Michael J. Black,et al.  Predicting 3D People from 2D Pictures , 2006, AMDO.

[90]  Michael J. Black,et al.  Learning an animatable detailed 3D face model from in-the-wild images , 2020, ArXiv.

[91]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[92]  William A. P. Smith,et al.  Fitting a 3D Morphable Model to Edges: A Comparison Between Hard and Soft Correspondences , 2016, ACCV Workshops.

[93]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[94]  Cristian Sminchisescu,et al.  GHUM & GHUML: Generative 3D Human Shape and Articulated Pose Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  Cristian Sminchisescu,et al.  Three-Dimensional Reconstruction of Human Interactions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[97]  M. Zollhöfer,et al.  Self-Supervised Multi-level Face Model Learning for Monocular Reconstruction at Over 250 Hz , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[98]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[99]  Tal Hassner,et al.  On Face Segmentation, Face Swapping, and Face Perception , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[100]  Thabo Beeler,et al.  3D Morphable Face Models—Past, Present, and Future , 2020, ACM Trans. Graph..

[101]  Michael J. Black,et al.  Monocular, One-stage, Regression of Multiple 3D People , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[102]  Patrick Pérez,et al.  MoFA: Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[103]  Prediction of 3D Body Parts from Face Shape and Anthropometric Measurements , 2020, Journal of Image and Graphics.

[104]  Yu Tian,et al.  Semantic Graph Convolutional Networks for 3D Human Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[106]  Vincent Lepetit,et al.  Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[107]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[108]  Xiaochen Hu,et al.  FACSIMILE: Fast and Accurate Scans From an Image in Less Than a Second , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[109]  Antonio Torralba,et al.  Face-to-BMI: Using Computer Vision to Infer Body Mass Index on Social Media , 2017, ICWSM.

[110]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[111]  Christian Theobalt,et al.  Monocular Real-time Full Body Capture with Inter-part Correlations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).