Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning

The flourishing of social media platforms requires techniques for understanding the content of media on a large scale. However, state-of-the art video event understanding approaches remain very limited in terms of their ability to deal with data sparsity, semantically unrepresentative event names, and lack of coherence between visual and textual concepts. Accordingly, in this paper, we propose a method of grounding visual concepts for large-scale Multimedia Event Detection (MED) and Multimedia Event Captioning (MEC) in zero-shot setting. More specifically, our framework composes the following: (1) deriving the novel semantic representations of events from their textual descriptions, rather than event names; (2) aggregating the ranks of grounded concepts for MED tasks. A statistical mean-shift outlier rejection model is proposed to remove the outlying concepts which are incorrectly grounded; and (3) defining MEC tasks and augmenting the MEC training set by the videos detected in MED in a zero-shot setting. To the best of our knowledge, this work is the first time to define and solve the MEC task, which is a further step towards understanding video events. We conduct extensive experiments and achieve state-of-the-art performance on the TRECVID MEDTest dataset, as well as our newly proposed TRECVID-MEC dataset.

[1]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Fei Su,et al.  A unified framework with a benchmark dataset for surveillance event detection , 2018, Neurocomputing.

[4]  Yi Yang,et al.  Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Yi Yang,et al.  They are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[9]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Yiyuan She,et al.  Outlier Detection Using Nonconvex Penalized Regression , 2010, ArXiv.

[11]  Huajun Chen,et al.  Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection , 2020, WSDM.

[12]  Chengqi Zhang,et al.  Dynamic Concept Composition for Zero-Example Event Detection , 2016, AAAI.

[13]  Ioannis Caragiannis,et al.  Optimizing Positional Scoring Rules for Rank Aggregation , 2016, AAAI.

[14]  Yong Man Ro,et al.  BMAN: Bidirectional Multi-Scale Aggregation Networks for Abnormal Event Detection , 2020, IEEE Transactions on Image Processing.

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[17]  Yueming Lu,et al.  A Multi-modal Clustering Method for Web Videos , 2012, ISCTCS.

[18]  Shaogang Gong,et al.  Attribute Learning for Understanding Unstructured Social Activity , 2012, ECCV.

[19]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[20]  Lu Liu,et al.  Human-Centric Cyber Social Computing Model for Hot-Event Detection and Propagation , 2019, IEEE Transactions on Computational Social Systems.

[21]  Zhihui Li,et al.  Zero-shot event detection via event-adaptive concept relevance mining , 2019, Pattern Recognit..

[22]  Haoran Yan,et al.  Event Detection with Multi-Order Graph Convolution and Aggregated Attention , 2019, EMNLP.

[23]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  U. Austin,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2017 .

[25]  Roberto Togneri,et al.  Auxiliary Classifier Generative Adversarial Network With Soft Labels in Imbalanced Acoustic Event Detection , 2019, IEEE Transactions on Multimedia.

[26]  Deng Cai,et al.  Sparse Coding Guided Spatiotemporal Feature Learning for Abnormal Event Detection in Large Videos , 2019, IEEE Transactions on Multimedia.

[27]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Yi Yang,et al.  Bi-Level Semantic Representation Analysis for Multimedia Event Detection , 2017, IEEE Transactions on Cybernetics.

[31]  Yi Yang,et al.  Semantic Pooling for Complex Event Analysis in Untrimmed Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[33]  Masoud Mazloom,et al.  Searching informative concept banks for video event detection , 2013, ICMR.

[34]  Chunping Liu,et al.  Trajectory-Pooled Spatial-Temporal Architecture of Deep Convolutional Neural Networks for Video Event Detection , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Nizar Bouguila,et al.  Variational Bayesian Learning of Generalized Dirichlet-Based Hidden Markov Models Applied to Unusual Events Detection , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[36]  Alexander G. Hauptmann,et al.  Instructional Videos for Unsupervised Harvesting and Learning of Action Examples , 2014, ACM Multimedia.

[37]  Xiaojun Chang,et al.  Revealing Event Saliency in Unconstrained Video Collection , 2017, IEEE Transactions on Image Processing.

[38]  Chong-Wah Ngo,et al.  A Fine Granularity Object-Level Representation for Event Detection and Recounting , 2019, IEEE Transactions on Multimedia.

[39]  Yi Yang,et al.  Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection , 2015, IJCAI.

[40]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ryosuke Nakamura,et al.  Rare Event Detection Using Disentangled Representation Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Arun Rajkumar,et al.  A Statistical Convergence Perspective of Algorithms for Rank Aggregation from Pairwise Data , 2014, ICML.

[43]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ricardo da Silva Torres,et al.  Unsupervised Graph-based Rank Aggregation for Improved Retrieval , 2019, Inf. Process. Manag..

[45]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[46]  Yu-Gang Jiang,et al.  Harnessing Object and Scene Semantics for Large-Scale Video Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Qinghua Zheng,et al.  Deep Semisupervised Zero-Shot Learning with Maximum Mean Discrepancy , 2018, Neural Computation.

[48]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[49]  S. Shapiro,et al.  Mathematics without Numbers , 1993 .

[50]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Hrishikesh B. Aradhye,et al.  Video2Text: Learning to Annotate Video Content , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[52]  Yi Yang,et al.  Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision , 2015, ACM Multimedia.

[53]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[54]  Lixu Gu,et al.  A novel active learning framework for classification: using weighted rank aggregation to achieve multiple query criteria , 2018, Pattern Recognit..

[55]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Ali Farhadi,et al.  Multi-attribute Queries: To Merge or Not to Merge? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Jian Liu,et al.  Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation Approach for Event Detection , 2019, AAAI.

[58]  Yi Yang,et al.  Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM , 2015, ICML.

[59]  Ling Shao,et al.  Object-Centric Auto-Encoders and Dummy Anomalies for Abnormal Event Detection in Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[61]  Qinghua Zheng,et al.  How Unlabeled Web Videos Help Complex Event Detection? , 2017, IJCAI.

[62]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[63]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.