论文信息 - Toward real-time indoor semantic segmentation using depth information

Toward real-time indoor semantic segmentation using depth information

This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on handcrafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. Using a frame by frame labeling, we obtain nearly state-of-the-art performance on the NYU-v2 depth dataset with an accuracy of 64.5%. We then show that the labeling can be further improved by exploiting the temporal consistency in the video sequence of the scene. To that goal, we present a method producing temporally consistent superpixels from a streaming video. Among the different methods producing superpixel segmentations of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time by using a minimum spanning tree. In a framework exploiting minimum spanning trees all along, we propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real-time applications. We illustrate the labeling of indoor scenes in video sequences that could be processed in real-time using appropriate hardware such as an FPGA.

[1] César Cadena,et al. Semantic Parsing for Priming Object Detection in RGB-D Scenes , 2013 .

[2] Sylvain Paris,et al. Edge-Preserving Smoothing and Mean-Shift Segmentation of Video Streams , 2008, ECCV.

[3] Kunihiko Fukushima,et al. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[4] Luca Maria Gambardella,et al. Flexible, High Performance Convolutional Neural Networks for Image Classification , 2011, IJCAI.

[5] T. Poggio,et al. Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[6] Sven Behnke,et al. Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[7] Luca Maria Gambardella,et al. Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images , 2012, NIPS.

[8] Fernand Meyer,et al. Graph-based object tracking , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[9] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10] Y. LeCun,et al. Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[11] Navdeep Jaitly,et al. Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition , 2012, INTERSPEECH.

[12] Leo Grady,et al. A Seeded Image Segmentation Framework Unifying Graph Cuts And Random Walker Which Yields A New Algorithm , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13] Yann LeCun,et al. Indoor Semantic Segmentation using depth information , 2013, ICLR.

[14] Daniel P. Huttenlocher,et al. Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[15] Myungcheol Lee,et al. Graph theory for image analysis: an approach based on the shortest spanning tree , 1986 .

[16] Camille Couprie. Multi-label energy minimization for object class segmentation , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[17] Jung-Hwan Oh,et al. Clustering of Video Objects by Graph Matching , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[18] Mei Han,et al. Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19] Yann LeCun,et al. Causal graph-based video segmentation , 2013, 2013 IEEE International Conference on Image Processing.

[20] Ronen Basri,et al. Contour-based joint clustering of multiple segmentations , 2011, CVPR 2011.

[21] Honglak Lee,et al. Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[22] Honglak Lee,et al. Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[23] Alexei A. Efros,et al. Geometric context from a single image , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[24] Fernand Meyer,et al. Minimum Spanning Forests for Morphological Segmentation , 1994, ISMM.

[25] Chenliang Xu,et al. Streaming Hierarchical Video Segmentation , 2012, ECCV.

[26] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[27] Gilles Bertrand,et al. Watershed Cuts: Minimum Spanning Forests and the Drop of Water Principle , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Yann LeCun,et al. Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers , 2012, ICML.

[29] Luca Maria Gambardella,et al. Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks , 2013, MICCAI.

[30] Jean Ponce,et al. Multi-class cosegmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31] Sven Behnke,et al. Learning Object-Class Segmentation with Convolutional Neural Networks , 2012, ESANN.

[32] Jörg Stückler,et al. Dense real-time mapping of object-class semantics from RGB-D video , 2013, Journal of Real-Time Image Processing.

[33] Camille Couprie,et al. Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Michel Couprie,et al. Some links between extremum spanning forests, watersheds and min-cuts , 2010, Image Vis. Comput..

[35] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[37] Nathan Silberman,et al. Indoor scene segmentation using a structured light sensor , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[38] Jonathan T. Barron,et al. A category-level 3-D object dataset: Putting the Kinect to work , 2011, ICCV Workshops.

[39] Camille Couprie,et al. Power Watershed: A Unifying Graph-Based Optimization Framework , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40] Andrew Y. Ng,et al. Convolutional-Recursive Deep Learning for 3D Object Classification , 2012, NIPS.

[41] M. Hebert,et al. Efficient temporal consistency for streaming video scene analysis , 2013, 2013 IEEE International Conference on Robotics and Automation.

[42] Dieter Fox,et al. RGB-(D) scene labeling: Features and algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Paria Mehrani,et al. Superpixels and Supervoxels in an Energy Optimization Framework , 2010, ECCV.

[44] Luiz Velho,et al. Kinect and RGBD Images: Challenges and Applications , 2012, 2012 25th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials.

[45] Clément Farabet,et al. Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[46] Dorin Comaniciu,et al. Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[47] Dani Lischinski,et al. Colorization using optimization , 2004, ACM Trans. Graph..

[48] Clément Farabet,et al. Towards real-time image understanding with convolutional networks , 2013 .