Violence Detection in Movies with Auditory and Visual Cues

In this paper, we present a novel method to detect violent scenes in movies. The detection process involves two views—audio and video views. For the audio view, a supervised method based on HCRFs is exploited to improve the classification performance. For the video view, we detect violent shots by locating the violent areas. Finally, the auditory and visual classifiers are combined in a co-training way. The experimental results on several movies with violent contents preliminarily show the effectiveness of our method.

[1]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[2]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[3]  Lie Lu,et al.  A flexible framework for key audio effects detection and auditory context inference , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Nuno Vasconcelos,et al.  Towards semantically meaningful feature spaces for the characterization of video content , 1997, Proceedings of International Conference on Image Processing.

[5]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[6]  Claire Cardie,et al.  Weakly Supervised Natural Language Learning Without Redundant Views , 2003, NAACL.

[7]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[8]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[9]  Anoop Sarkar,et al.  Applying Co-Training Methods to Statistical Parsing , 2001, NAACL.

[10]  Wen Gao,et al.  Effective scene matching with local feature representatives , 2008, 2008 19th International Conference on Pattern Recognition.

[11]  Raimondo Schettini,et al.  Content-Based Classification of Digital Photos , 2002, Multiple Classifier Systems.

[12]  Jeho Nam,et al.  Audio-visual content-based violent scene characterization , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[13]  Douglas Keislar,et al.  Content-Based Classification, Search, and Retrieval of Audio , 1996, IEEE Multim..

[14]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ahmed H. Tewfik,et al.  Data hiding for video-in-video , 1997, Proceedings of International Conference on Image Processing.

[16]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[17]  James R. Curran,et al.  Bootstrapping POS-taggers using unlabelled data , 2003, CoNLL.

[18]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[19]  R. El-Khoribi Hidden Conditional Random Fields for ECG Classification , 2008 .

[20]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[21]  Wen-Huang Cheng,et al.  Semantic context detection based on hierarchical audio models , 2003, MIR '03.

[22]  Tracie Pasold,et al.  Violence exposure in real-life, video games, television, movies, and the internet: is there desensitization? , 2004, Journal of adolescence.

[23]  Mubarak Shah,et al.  Person-on-person violence detection in video data , 2002, Object recognition supported by user interaction for service robots.