Emotion Recognition from Speech

Spoken language is one of the main interaction patterns in human-human as well as in natural, companion-like human-machine interactions. Speech conveys content, but also emotions and interaction patterns determining the nature and quality of the user’s relationship to his counterpart. Hence, we consider emotion recognition from speech in the wider sense of application in Companion-systems. This requires a dedicated annotation process to label emotions and to describe their temporal evolution in view of a proper regulation and control of a system’s reaction. This problem is peculiar for naturalistic interactions, where the emotional labels are no longer a priori given. This calls for generating and measuring of a reliable ground truth, where the measurement is closely related to the usage of appropriate emotional features and classification techniques. Further, acted and naturalistic spoken data has to be available in operational form (corpora) for the development of emotion classification; we address the difficulties arising from the variety of these data sources. Speaker clustering and speaker adaptation will as well improve the emotional modeling. Additionally, a combination of the acoustical affective evaluation and the interpretation of non-verbal interaction patterns will lead to a better understanding of and reaction to user-specific emotional behavior.

[1]  Ioannis Pitas,et al.  Automatic emotional speech classification , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[3]  Astrid Paeschke,et al.  A database of German emotional speech , 2005, INTERSPEECH.

[4]  Ingo Siegert,et al.  Appropriate emotional labelling of non-acted speech using basic emotions, geneva emotion wheel and self assessment manikins , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[5]  P. Ekman Are there basic emotions? , 1992, Psychological review.

[6]  Shrikanth S. Narayanan,et al.  The Vera am Mittag German audio-visual emotional speech database , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[7]  Maja Pantic,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING , 2022 .

[8]  Ronald Böck,et al.  EmoGest: Investigating the Impact of Emotions on Spontaneous Co-speech Gestures , 2014 .

[9]  Björn W. Schuller,et al.  Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach , 2010, Adv. Hum. Comput. Interact..

[10]  Vitomir Štruc,et al.  Towards Efficient Multi-Modal Emotion Recognition , 2013 .

[11]  Björn W. Schuller,et al.  Speaker Independent Speech Emotion Recognition by Ensemble Classification , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[12]  Andreas Wendemuth,et al.  Determining optimal signal features and parameters for HMM-based emotion classification , 2010, Melecon 2010 - 2010 15th IEEE Mediterranean Electrotechnical Conference.

[13]  Elisabeth André,et al.  Comparing Feature Sets for Acted and Spontaneous Speech in View of Automatic Emotion Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[14]  Günther Palm,et al.  Wizard-of-Oz Data Collection for Perception and Interaction in Multi-User Environments , 2006, LREC.

[15]  Constantine Kotropoulos,et al.  Emotional speech recognition: Resources, features, and methods , 2006, Speech Commun..

[16]  Ingo Siegert,et al.  Inter-rater reliability for emotion annotation in human–computer interaction: comparison and methodological improvements , 2013, Journal on Multimodal User Interfaces.

[17]  Friedhelm Schwenker,et al.  Multimodal Emotion Classification in Naturalistic User Behavior , 2011, HCI.

[18]  Ingo Siegert,et al.  Discourse Particles and User Characteristics in Naturalistic Human-Computer Interaction , 2014, HCI.

[19]  W. Minker,et al.  Handling Emotions in Human-Computer Dialogues , 2009 .

[20]  Roddy Cowie,et al.  FEELTRACE: an instrument for recording perceived emotion in real time , 2000 .

[21]  Jonathan Harrington,et al.  Age-related changes in fundamental frequency and formants: a longitudinal study of four speakers , 2007, INTERSPEECH.

[22]  L. Rothkrantz,et al.  Toward an affect-sensitive multimodal human-computer interaction , 2003, Proc. IEEE.

[23]  Günther Palm,et al.  A Novel Feature for Emotion Recognition in Voice Based Applications , 2007, ACII.

[24]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[25]  Carlos Busso,et al.  Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions , 2009, INTERSPEECH.

[26]  Björn W. Schuller,et al.  AVEC 2011-The First International Audio/Visual Emotion Challenge , 2011, ACII.

[27]  Friedhelm Schwenker,et al.  Using speaker group dependent modelling to improve fusion of fragmentary classifier decisions , 2013, 2013 IEEE International Conference on Cybernetics (CYBCO).

[28]  Finnian Kelly,et al.  Effects of Long-Term Ageing on Speaker Verification , 2011, BIOID.

[29]  Günther Palm,et al.  Towards Emotion Recognition in Human Computer Interaction , 2012, WIRN.

[30]  Kim Hartmann,et al.  Investigation of Speaker Group-Dependent Modelling for Recognition of Affective States from Speech , 2014, Cognitive Computation.

[31]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[32]  Björn W. Schuller,et al.  Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge , 2011, Speech Commun..

[33]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[34]  Zhigang Deng,et al.  Emotion recognition based on phoneme classes , 2004, INTERSPEECH.

[35]  Klaus R. Scherer,et al.  Unconscious processes in emotion: The bulk of the iceberg , 2005 .

[36]  Shrikanth S. Narayanan,et al.  Primitives-based evaluation and estimation of emotions in speech , 2007, Speech Commun..

[37]  Friedhelm Schwenker,et al.  Investigating fuzzy-input fuzzy-output support vector machines for robust voice quality classification , 2013, Comput. Speech Lang..

[38]  Ingo Siegert,et al.  Audio-Based Pre-classification for Semi-automatic Facial Expression Coding , 2013, HCI.

[39]  Kim Hartmann,et al.  Human Behaviour in HCI: Complex Emotion Detection through Sparse Speech Features , 2013, HBU.

[40]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[41]  Ramón López-Cózar,et al.  Influence of contextual information in emotion annotation for spoken dialogue systems , 2008, Speech Commun..

[42]  Björn W. Schuller,et al.  Hidden Markov model-based speech emotion recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[43]  Jonghwa Kim,et al.  Transsituational Individual-Specific Biopsychological Classification of Emotions , 2013, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[44]  Ingo Siegert,et al.  The Influence of Context Knowledge for Multi-modal Affective Annotation , 2013, HCI.

[45]  Ingo Siegert,et al.  Analysis of significant dialog events in realistic human–computer interaction , 2013, Journal on Multimodal User Interfaces.

[46]  Kim Hartmann,et al.  Investigating the Form-Function-Relation of the Discourse Particle "hm" in a Naturalistic Human-Computer Interaction , 2013, WIRN.

[47]  Ingo Siegert,et al.  Vowels formants analysis allows straightforward detection of high arousal emotions , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[48]  Zhihong Zeng,et al.  A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions , 2009, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[50]  Ronald Böck,et al.  Disposition Recognition from Spontaneous Speech Towards a Combination with Co-speech Gestures , 2014, MA3HMI@INTERSPEECH.

[51]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[52]  Björn W. Schuller,et al.  The INTERSPEECH 2009 emotion challenge , 2009, INTERSPEECH.

[53]  Björn W. Schuller,et al.  Acoustic emotion recognition: A benchmark comparison of performances , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[54]  Andreas Wendemuth,et al.  Location of an emotionally neutral region in valence-arousal space: Two-class vs. three-class cross corpora emotion recognition evaluations , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[55]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[56]  Elisabeth André,et al.  Improving Automatic Emotion Recognition from Speech via Gender Differentiaion , 2006, LREC.

[57]  Ingo Mierswa Automatic Feature Extraction from Large Time Series , 2004, LWA.

[58]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .

[59]  Jon D. Morris Observations: SAM: The Self-Assessment Manikin An Efficient Cross-Cultural Measurement Of Emotional Response 1 , 1995 .

[60]  Theodoros Iliou,et al.  Comparison of Different Classifiers for Emotion Recognition , 2009, 2009 13th Panhellenic Conference on Informatics.

[61]  Loïc Kessous,et al.  Whodunnit - Searching for the most important feature types signalling emotion-related user states in speech , 2011, Comput. Speech Lang..

[62]  Renata Franc,et al.  Age and gender differences in affect regulation strategies , 2009 .

[63]  Andreas Wendemuth,et al.  Intraindividual and interindividual multimodal emotion analyses in Human-Machine-Interaction , 2012, 2012 IEEE International Multi-Disciplinary Conference on Cognitive Methods in Situation Awareness and Decision Support.