Tracking and Planning with Spatial World Models

We introduce a method for real-time navigation and tracking with differentiably rendered world models. Learning models for control has led to impressive results in robotics and computer games, but this success has yet to be extended to vision-based navigation. To address this, we transfer advances in the emergent field of differentiable rendering to model-based control. We do this by planning in a learned 3D spatial world model, combined with a pose estimation algorithm previously used in the context of TSDF fusion, but now tailored to our setting and improved to incorporate agent dynamics. We evaluate over six simulated environments based on complex human-designed floor plans and provide quantitative results. We achieve up to 92% navigation success rate at a frequency of 15 Hz using only image and depth observations under stochastic, continuous dynamics.

[1]  Maximilian Karl,et al.  Learning to Fly via Deep Model-Based Reinforcement Learning , 2020, ArXiv.

[2]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[3]  M. Meng,et al.  HouseExpo: A Large-scale 2D Indoor Layout Dataset for Learning-based Algorithms on Mobile Robots , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[5]  Edgar Sucar,et al.  iMAP: Implicit Mapping and Positioning in Real-Time , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Ruslan Salakhutdinov,et al.  Neural Map: Structured Memory for Deep Reinforcement Learning , 2017, ICLR.

[7]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[8]  Sergey Levine,et al.  MELD: Meta-Reinforcement Learning from Images via Latent State Models , 2020, CoRL.

[9]  Daniel Cremers,et al.  Real-time visual odometry from dense RGB-D images , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[10]  Santhosh K. Ramakrishnan,et al.  Occupancy Anticipation for Efficient Exploration and Navigation , 2020, ECCV.

[11]  Jonathan T. Barron,et al.  iNeRF: Inverting Neural Radiance Fields for Pose Estimation , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[12]  Geoffrey E. Hinton,et al.  A Mobile Robot That Learns Its Place , 1997, Neural Computation.

[13]  Martin D. Levine,et al.  Registering Multiview Range Data to Create 3D Computer Objects , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Marek Wydmuch,et al.  ViZDoom Competitions: Playing Doom From Pixels , 2018, IEEE Transactions on Games.

[15]  Matthias Nießner,et al.  Real-time 3D reconstruction at scale using voxel hashing , 2013, ACM Trans. Graph..

[16]  Justin Bayer,et al.  Variational State-Space Models for Localisation and Dense 3D Mapping in 6 DoF , 2020, ICLR.

[17]  Andrew I. Comport,et al.  Real-time dense appearance-based SLAM for RGB-D sensors , 2011 .

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Patrick van der Smagt,et al.  Navigation and planning in latent maps , 2018 .

[20]  Gérard G. Medioni,et al.  Object modelling by registration of multiple range images , 1992, Image Vis. Comput..

[21]  Vladlen Koltun,et al.  Semi-parametric Topological Memory for Navigation , 2018, ICLR.

[22]  Jitendra Malik,et al.  On Evaluation of Embodied Navigation Agents , 2018, ArXiv.

[23]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[24]  Ronen Basri,et al.  Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance , 2020, NeurIPS.

[25]  Rahul Sukthankar,et al.  Cognitive Mapping and Planning for Visual Navigation , 2017, International Journal of Computer Vision.

[26]  Andreas Geiger,et al.  Differentiable Volumetric Rendering: Learning Implicit 3D Representations Without 3D Supervision , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[28]  Justin Bayer,et al.  Approximate Bayesian inference in spatial environments , 2018, Robotics: Science and Systems.

[29]  Jeannette Bohg,et al.  Vision-Only Robot Navigation in a Neural Radiance World , 2021, IEEE Robotics and Automation Letters.

[30]  Tim Golla State of the Art in Real-time Registration of RGB-D Images , 2016 .

[31]  Stefan Leutenegger,et al.  ElasticFusion: Dense SLAM Without A Pose Graph , 2015, Robotics: Science and Systems.

[32]  Fabio Viola,et al.  Generative Temporal Models with Spatial Memory for Partially Observed Environments , 2018, ICML.

[33]  Yiyi Liao,et al.  KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Aaron van den Oord,et al.  Shaping Belief States with Generative Environment Models for RL , 2019, NeurIPS.

[35]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[36]  Victor Adrian Prisacariu,et al.  NeRF-: Neural Radiance Fields Without Known Camera Parameters , 2021, ArXiv.

[37]  Gordon Wetzstein,et al.  Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations , 2019, NeurIPS.

[38]  Ruslan Salakhutdinov,et al.  Learning to Explore using Active Neural SLAM , 2020, ICLR.

[39]  Vincent Sitzmann,et al.  3D Neural Scene Representations for Visuomotor Control , 2021, CoRL.

[40]  Thomas Müller,et al.  Neural Importance Sampling , 2018, ACM Trans. Graph..