Unsupervised part representation by Flow Capsules

Capsule networks are designed to parse an image into a hierarchy of objects, parts and relations. While promising, they remain limited by an inability to learn effective low level part descriptions. To address this issue we propose a novel self-supervised method for learning part descriptors of an image. During training, we exploit motion as a powerful perceptual cue for part definition, using an expressive decoder for part generation and layered image formation with occlusion. Experiments demonstrate robust part discovery in the presence of multiple objects, cluttered backgrounds, and significant occlusion. The resulting part descriptors, a.k.a. part capsules, are decoded into shape masks, filling in occluded pixels, along with relative depth on single images. We also report unsupervised object classification using our capsule parts in a stacked capsule autoencoder.

[1]  Hao Zhang,et al.  Learning Implicit Fields for Generative Shape Modeling , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yaron Lipman,et al.  SAL: Sign Agnostic Learning of Shapes From Raw Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Li Fei-Fei,et al.  Learning Physical Graph Representations from Visual Scenes , 2020, NeurIPS.

[4]  Jonathan Tompson,et al.  Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning , 2018, NeurIPS.

[5]  Andrea Vedaldi,et al.  Self-supervised Segmentation by Grouping Optical-Flow , 2018, ECCV Workshops.

[6]  Elizabeth S. Spelke,et al.  Principles of Object Perception , 1990, Cogn. Sci..

[7]  Geoffrey E. Hinton,et al.  Matrix capsules with EM routing , 2018, ICLR.

[8]  Konstantinos G. Derpanis,et al.  Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness , 2016, ECCV Workshops.

[9]  Jürgen Schmidhuber,et al.  Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[10]  Federico Tombari,et al.  3D Point Capsule Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ronen Basri,et al.  Frequency Bias in Neural Networks for Input of Non-Uniform Density , 2020, ICML.

[12]  Dieter Fox,et al.  SE3-nets: Learning rigid body motion using deep neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[14]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[15]  Mubarak Shah,et al.  VideoCapsuleNet: A Simplified Network for Action Detection , 2018, NeurIPS.

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[18]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[19]  Nitish Srivastava,et al.  Geometric Capsule Autoencoders for 3D Point Clouds , 2019, ArXiv.

[20]  Gunhee Kim,et al.  Self-Routing Capsule Networks , 2019, NeurIPS.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Fei Deng,et al.  Generative Scene Graph Networks , 2021, ICLR.

[23]  Jonathan T. Barron,et al.  Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[24]  Yee Whye Teh,et al.  Stacked Capsule Autoencoders , 2019, NeurIPS.

[25]  Chen Sun,et al.  Unsupervised Discovery of Parts, Structure, and Dynamics , 2019, ICLR.

[26]  Yair Weiss,et al.  Smoothness in layers: Motion segmentation using nonparametric mixture estimation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ulas Bagci,et al.  Capsules for Object Segmentation , 2018, ArXiv.

[29]  S. Palmer,et al.  A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure-ground organization. , 2012, Psychological bulletin.

[30]  Gideon Kowadlo,et al.  Sparse Unsupervised Capsules Generalize Better , 2018, ArXiv.

[31]  Geoffrey E. Hinton,et al.  Transforming Auto-Encoders , 2011, ICANN.

[32]  David J. Fleet,et al.  A Layered Motion Representation with Occlusion and Compact Spatial Support , 2002, ECCV.

[33]  Jiajun Wu,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.

[34]  Soren Hauberg,et al.  Explicit Disentanglement of Appearance and Perspective in Generative Models , 2019, NeurIPS.

[35]  Edward H. Adelson,et al.  Representing moving images with layers , 1994, IEEE Trans. Image Process..

[36]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[37]  Kevin Murphy,et al.  Efficient inference in occlusion-aware generative models of images , 2015, ArXiv.

[38]  Geoffrey E. Hinton,et al.  Canonical Capsules: Unsupervised Capsules in Canonical Pose , 2020, ArXiv.

[39]  Lorenzo Torresani,et al.  STAR-Caps: Capsule Networks with Straight-Through Attentive Routing , 2019, NeurIPS.

[40]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[41]  Fang Liu,et al.  Real-time recognition with the entire Brodatz texture database , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[42]  G. Kanizsa Subjective contours. , 1976, Scientific American.

[43]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[44]  Subhransu Maji,et al.  Shape Reconstruction Using Differentiable Projections and Deep Priors , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Donald D. Hoffman,et al.  Part-Based Representations of Visual Shape and Implications for Visual Cognition , 2001 .