Three-D Safari: Learning to Estimate Zebra Pose, Shape, and Texture From Images “In the Wild”

We present the first method to perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild. In particular, we focus on the problem of capturing 3D information about Grevy's zebras from a collection of images. The Grevy's zebra is one of the most endangered species in Africa, with only a few thousand individuals left. Capturing the shape and pose of these animals can provide biologists and conservationists with information about animal health and behavior. In contrast to research on human pose, shape and texture estimation, training data for endangered species is limited, the animals are in complex natural scenes with occlusion, they are naturally camouflaged, travel in herds, and look similar to each other. To overcome these challenges, we integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation. Going beyond state-of-the-art methods for human shape and pose estimation, our method learns a shape space for zebras during training. Learning such a shape space from images using only a photometric loss is novel, and the approach can be used to learn shape in other settings with limited 3D supervision. Moreover, we couple 3D pose and shape prediction with the task of texture synthesis, obtaining a full texture map of the animal from a single image. We show that the predicted texture map allows a novel per-instance unsupervised optimization over the network features. This method, SMALST (SMAL with learned Shape and Texture) goes beyond previous work, which assumed manual keypoints and/or segmentation, to regress directly from pixels to 3D animal shape, pose and texture.

[1]  Zoran Popovic,et al.  The space of human body shapes: reconstruction and parameterization from range scans , 2003, ACM Trans. Graph..

[2]  Sebastian Thrun,et al.  SCAPE: shape completion and animation of people , 2005, SIGGRAPH 2005.

[3]  Dragomir Anguelov,et al.  SCAPE: shape completion and animation of people , 2005, ACM Trans. Graph..

[4]  Lourdes Agapito,et al.  Balloon Shapes: Reconstructing and Deforming Objects with Volume from Images , 2013, 2013 International Conference on 3D Vision.

[5]  Andrew W. Fitzgibbon,et al.  What Shape Are Dolphins? Building 3D Morphable Models from 2D Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[8]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2015, ACM Trans. Graph..

[10]  Fiora Pirri,et al.  Component-Wise Modeling of Articulated Objects , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Michael J. Black,et al.  Detailed Full-Body Reconstructions of Moving People from Monocular RGB-D Sequences , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Bernhard Egger,et al.  Background modeling for generative image models , 2015, Comput. Vis. Image Underst..

[13]  Hans-Peter Seidel,et al.  Animated 3D Creatures from Single-view Video by Skeletal Sketching , 2016, Graphics Interface.

[14]  Ronen Basri,et al.  Learning 3D Deformation of Animals from 2D Images , 2015, Comput. Graph. Forum.

[15]  Andreas Geiger,et al.  Augmented Reality meets Deep Learning , 2017, BMVC.

[16]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Michael J. Black,et al.  3D Menagerie: Modeling the 3D Shape and Pose of Animals , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ignas Budvytis,et al.  Indirect deep structured learning for 3D human body shape and pose prediction , 2017, BMVC.

[19]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  René Vidal,et al.  3D Pose Regression Using Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[21]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[22]  Jiajun Wu,et al.  Self-Supervised Intrinsic Image Decomposition , 2017, NIPS.

[23]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Andrew W. Fitzgibbon,et al.  Creatures great and SMAL: Recovering the shape and motion of animals from video , 2018, ACCV.

[26]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[27]  Marcus A. Magnor,et al.  Video Based Reconstruction of 3D People Models , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Kevin M. Cury,et al.  DeepLabCut: markerless pose estimation of user-defined body parts with deep learning , 2018, Nature Neuroscience.

[30]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[31]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Michael J. Black,et al.  Lions and Tigers and Bears: Capturing Non-rigid, 3D, Articulated Shape from Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[35]  R. Vidal,et al.  3 D Pose Regression using Convolutional Neural Networks , .