Combination of sequential class distributions from multiple channels using Markov fusion networks

The recognition of patterns in real-time scenarios has become an important trend in the field of multi-modal user interfaces in human computer interaction. Cognitive technical systems aim to improve the human computer interaction by means of recognizing the situative context, e.g. by activity recognition (Ahad et al. in IEEE, 1896–1901, 2008), or by estimating the affective state (Zeng et al., IEEE Trans Pattern Anal Mach Intell 31(1):39–58, 2009) of the human dialogue partner. Classifier systems developed for such applications must operate on multiple modalities and must integrate the available decisions over large time periods. We address this topic by introducing the Markov fusion network (MFN) which is a novel classifier combination approach, for the integration of multi-class and multi-modal decisions continuously over time. The MFN combines results while meeting real-time requirements, weighting decisions of the modalities dynamically, and dealing with sensor failures. The proposed MFN has been evaluated in two empirical studies: the recognition of objects involved in human activities, and the recognition of emotions where we successfully demonstrate its outstanding performance. Furthermore, we show how the MFN can be applied in a variety of different architectures and the several options to configure the model in order to meet the demands of a distinct problem.

[1]  Günther Palm,et al.  On the discovery of events in EEG data utilizing information fusion , 2013, Comput. Stat..

[2]  J. Allwood A Framework for Studying Human Multimodal Communication , 2013 .

[3]  Vladimir Pavlovic,et al.  Face tracking and recognition with visual constraints in real-world videos , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Michael Beetz,et al.  CoTeSys—Cognition for Technical Systems , 2010, KI - Künstliche Intelligenz.

[5]  Friedhelm Schwenker,et al.  Conditioned Hidden Markov Model Fusion for Multimodal Classification , 2011, INTERSPEECH.

[6]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[7]  Eric Horvitz,et al.  Layered representations for learning and inferring office activity from multiple sensory channels , 2004, Comput. Vis. Image Underst..

[8]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[9]  Yong Gu Ji Inferring prosody from facial cues for EMG-based synthesis of silent speech , 2012 .

[10]  Günther Palm,et al.  Multiple classifier combination using reject options and markov fusion networks , 2012, ICMI '12.

[11]  Günther Palm,et al.  Dempster-Shafer Fusion of Context Sources for Pedestrian Recognition , 2012, Belief Functions.

[12]  李幼升,et al.  Ph , 1989 .

[13]  P. Ekman An argument for basic emotions , 1992 .

[14]  Friedhelm Schwenker,et al.  Kalman Filter Based Classifier Fusion for Affective State Recognition , 2013, MCS.

[15]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[16]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[17]  Günther Palm,et al.  Detecting Actions by Integrating Sequential Symbolic and Sub-symbolic Information in Human Activity Recognition , 2012, MLDM.

[18]  E. Anna,et al.  Cognitive Behavioural Systems , 2012 .

[19]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[20]  Maja Pantic,et al.  The SEMAINE corpus of emotionally coloured character interactions , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[21]  Markus Kächele,et al.  Using unlabeled data to improve classification of emotional states in human computer interaction , 2013, Journal on Multimodal User Interfaces.

[22]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[23]  Friedhelm Schwenker,et al.  Fusion of Fragmentary Classifier Decisions for Affective State Recognition , 2012, MPRSS.

[24]  Günther Palm,et al.  A generic framework for the inference of user states in human computer interaction , 2012, Journal on Multimodal User Interfaces.

[25]  Sebastian Thrun,et al.  An Application of Markov Random Fields to Range Sensing , 2005, NIPS.

[26]  William T. Freeman,et al.  Orientation Histograms for Hand Gesture Recognition , 1995 .

[27]  Wolfgang Wahlster,et al.  SmartKom: Symmetric Multimodality in an Adaptive and Reusable Dialogue Shell , 2003 .

[28]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  G. Palm,et al.  Learning of Decision Fusion Mappings for Pattern Recognition , 2006 .

[30]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[31]  Markus Kächele,et al.  Classification of Emotional States in a Woz Scenario Exploiting Labeled and Unlabeled Bio-physiological Data , 2011, PSL.

[32]  Friedhelm Schwenker,et al.  Classification of bioacoustic time series based on the combination of global and local decisions , 2004, Pattern Recognit..

[33]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[34]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[35]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[36]  Ana Paiva,et al.  Affect recognition for interactive companions: challenges and design in real world scenarios , 2009, Journal on Multimodal User Interfaces.

[37]  Friedhelm Schwenker,et al.  Incorporating uncertainty in a layered HMM architecture for human activity recognition , 2011, J-HGBU '11.

[38]  Nadia Bianchi-Berthouze,et al.  Naturalistic Affective Expression Classification by a Multi-stage Approach Based on Hidden Markov Models , 2011, ACII.

[39]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[40]  K. Scherer,et al.  The World of Emotions is not Two-Dimensional , 2007, Psychological science.

[41]  Andreas Wendemuth,et al.  Companion-Technology for Cognitive Technical Systems , 2011, KI - Künstliche Intelligenz.

[42]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[43]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[44]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[45]  Louis-Philippe Morency,et al.  Modeling Latent Discriminative Dynamic of Multi-dimensional Affective Signals , 2011, ACII.

[46]  S. Ishikawa,et al.  Human activity recognition: Various paradigms , 2008, 2008 International Conference on Control, Automation and Systems.

[47]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[48]  Friedhelm Schwenker,et al.  Partially Supervised Learning , 2011, Lecture Notes in Computer Science.

[49]  Stephen E. Levinson,et al.  A fused hidden Markov model with application to bimodal speech processing , 2004, IEEE Transactions on Signal Processing.

[50]  Alex Pentland,et al.  Social signal processing: state-of-the-art and future perspectives of an emerging domain , 2008, ACM Multimedia.

[51]  Björn W. Schuller,et al.  Towards More Reality in the Recognition of Emotional Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[52]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[53]  Günther Palm,et al.  Towards Emotion Recognition in Human Computer Interaction , 2012, WIRN.

[54]  Christian R. Dietrich,et al.  Temporal sensorfusion for the classification of bioacoustic time series , 2004 .

[55]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[56]  Günther Palm,et al.  Hidden Markov models with graph densities for action recognition , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[57]  Mário A. T. Figueiredo,et al.  Similarity-Based Clustering of Sequences Using Hidden Markov Models , 2003, MLDM.

[58]  Sascha Meudt,et al.  Multi-Modal Classifier-Fusion for the Recognition of Emotions , 2013 .

[59]  Gwen Littlewort,et al.  The computer expression recognition toolbox (CERT) , 2011, Face and Gesture 2011.

[60]  Christian Thiel,et al.  Multiple Classifier Systems Incorporating Uncertainty , 2011 .

[61]  Björn W. Schuller,et al.  Frame vs. Turn-Level: Emotion Recognition from Speech Considering Static and Dynamic Processing , 2007, ACII.

[62]  Björn W. Schuller,et al.  Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling , 2010, INTERSPEECH.

[63]  Markus Kächele,et al.  Multiple Classifier Systems for the Classification of Audio-Visual Emotional States , 2011, ACII.