Video Pixel Networks

We propose a probabilistic video model, the Video Pixel Network (VPN), that estimates the discrete joint distribution of the raw pixel values in a video. The model and the neural architecture reflect the time, space and color structure of video tensors and encode it as a four-dimensional dependency chain. The VPN approaches the best possible performance on the Moving MNIST benchmark, a leap over the previous state of the art, and the generated videos show only minor deviations from the ground truth. The VPN also produces detailed samples on the action-conditional Robotic Pushing benchmark and generalizes to the motion of novel objects.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[3]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[4]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[5]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[6]  Viorica Patraucean,et al.  Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.

[7]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[8]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[9]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[10]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[11]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[12]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[13]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[14]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[15]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[16]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[17]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[18]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.