Spatio-temporal Predictive Network For Videos With Physical Properties

In this paper, we propose a spatio-temporal predictive network with attention weighting of multiple physical Deep Learning (DL) models for videos with various physical properties. Previous approaches have been models with multiple branches for difference properties in videos, but the outputs of branches have been simply summed even with properties that change in time and space. In addition, it is difficult to train previous models for sufficient representations of physical properties in videos. Therefore, we propose the design of the spatio-temporal prediction network and the training method for videos with multiple physical properties, motivated by the Mixtures of Experts framework. Multiple spatio-temporal DL branches/experts for multiple physical properties and pixel-wise and expert-wise attention mechanism for adaptively integrating outputs of experts, i.e., Spatial-Temporal Gating Networks (STGNs) are proposed. Experts are trained with a vast amount of synthetic image sequences by physical equations and noise models. Instead, the whole network including STGNs is allowed to be trained only with a limited number of real datasets. Experiments on various videos, i.e., traffic, pedestrian, Dynamic Texture videos, and radar images, show the superiority of our proposed approach compared with previous approaches.

[1]  Bolei Zhou,et al.  Moments in Time Dataset: One Million Videos for Event Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  A. Bimbo,et al.  MANTRA: Memory Augmented Networks for Multiple Trajectory Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Tinne Tuytelaars,et al.  Expert Gate: Lifelong Learning with a Network of Experts , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  Houqiang Li,et al.  M-LVC: Multiple Frames Prediction for Learned Video Compression , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Stefano Soatto,et al.  Dynamic Textures , 2003, International Journal of Computer Vision.

[9]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  William Yang Wang,et al.  Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning , 2018, AAAI.

[11]  Dit-Yan Yeung,et al.  Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model , 2017, NIPS.

[12]  Philip S. Yu,et al.  PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs , 2017, NIPS.

[13]  Thomas Brox,et al.  Multimodal Future Localization and Emergence Prediction for Objects in Egocentric View With a Reachability Prior , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Varun Jampani,et al.  Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Vincent Lepetit,et al.  On Pre-Trained Image Features and Synthetic Images for Deep Learning , 2017, ECCV Workshops.

[17]  Jaesik Park,et al.  Future Video Synthesis With Object Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[19]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[21]  Bin Dong,et al.  PDE-Net 2.0: Learning PDEs from Data with A Numeric-Symbolic Hybrid Deep Network , 2018, J. Comput. Phys..

[22]  Philip S. Yu,et al.  PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning , 2018, ICML.

[23]  Konstantinos G. Derpanis,et al.  Two-Stream Convolutional Networks for Dynamic Texture Synthesis , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Nicolas Thome,et al.  Disentangling Physical Dynamics From Unknown Factors for Unsupervised Video Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zhihai He,et al.  Reciprocal Learning Networks for Human Trajectory Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yue Hu,et al.  Collaborative Motion Prediction via Neural Motion Message Passing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[28]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Yinhe Han,et al.  Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[32]  Marc'Aurelio Ranzato,et al.  Hard Mixtures of Experts for Large Scale Weakly Supervised Vision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jason Hickey,et al.  Machine Learning for Precipitation Nowcasting from Radar Images , 2019, ArXiv.

[34]  Ming-Hsuan Yang,et al.  PiCANet: Learning Pixel-Wise Contextual Attention for Saliency Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Vladlen Koltun,et al.  Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Alexander Wong,et al.  Squeeze-and-Attention Networks for Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Patrick Gallinari,et al.  Deep learning for physical processes: incorporating prior scientific knowledge , 2017, ICLR.

[39]  Philip S. Yu,et al.  Memory in Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity From Spatiotemporal Dynamics , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  B. Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[41]  Bingbing Ni,et al.  Video Prediction via Example Guidance , 2020, ICML.

[42]  Richard P. Wildes,et al.  A New Large Scale Dynamic Texture Dataset with Application to ConvNet Understanding , 2018, ECCV.

[43]  Lorenzo Torresani,et al.  Network of Experts for Large-Scale Image Categorization , 2016, ECCV.

[44]  Jonathan T. Barron,et al.  A General and Adaptive Robust Loss Function , 2017, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yunbo Wang,et al.  Probabilistic Video Prediction From Noisy Data With a Posterior Confidence , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[47]  Wenguan Wang,et al.  Shifting More Attention to Video Salient Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  John C. Hart Perlin noise pixel shaders , 2001, HWWS '01.