ISLA: Temporal Segmentation and Labeling for Audio-Visual Emotion Recognition

被引:27
|
作者
Kim, Yelin [1 ]
Provost, Emily Mower [2 ]
机构
[1] SUNY Albany, Dept Elect & Comp Engn, Albany, NY 12206 USA
[2] Univ Michigan, Dept Elect Engn & Comp Sci, Ann Arbor, MI 48109 USA
关键词
Audio-visual; emotion; recognition; multimodal; temporal; face region; speech; FACIAL EXPRESSION; SPEECH; CLASSIFICATION; MODALITIES; MOVEMENT; PROSODY; AREAS;
D O I
10.1109/TAFFC.2017.2702653
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion is an essential part of human interaction. Automatic emotion recognition can greatly benefit human-centered interactive technology, since extracted emotion can be used to understand and respond to user needs. However, real-world emotion recognition faces a central challenge when a user is speaking: facial movements due to speech are often confused with facial movements related to emotion. Recent studies have found that the use of phonetic information can reduce speech-related variability in the lower face region. However, methods to differentiate upper face movements due to emotion and due to speech have been underexplored. This gap leads us to the proposal of the Informed Segmentation and Labeling Approach (ISLA). ISLA uses speech signals that alter the dynamics of the lower and upper face regions. We demonstrate how pitch can be used to improve estimates of emotion from the upper face, and how this estimate can be combined with emotion estimates from the lower face and speech in a multimodal classification system. Our emotion classification results on the IEMOCAP and SAVEE datasets show that ISLA improves overall classification performance. We also demonstrate how emotion estimates from different modalities correlate with each other, providing insights into the differences between posed and spontaneous expressions.
引用
收藏
页码:196 / 208
页数:13
相关论文
共 50 条
  • [41] Audio-visual gender recognition
    Liu, Ming
    Xu, Xun
    Huang, Thomas S.
    MIPPR 2007: PATTERN RECOGNITION AND COMPUTER VISION, 2007, 6788
  • [42] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
  • [43] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [44] AVSegFormer: Audio-Visual Segmentation with Transformer
    Gao, Shengyi
    Chen, Zhe
    Chen, Guo
    Wang, Wenhai
    Lu, Tong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 11, 2024, : 12155 - 12163
  • [45] Unsupervised Audio-Visual Lecture Segmentation
    Singh, S. Darshan
    Gupta, Anchit
    Jawahar, C. V.
    Tapaswi, Makarand
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 5221 - 5230
  • [46] Audio-Visual Emotion Recognition using Gaussian Mixture Models for Face and Voice
    Metallinou, Angeliki
    Lee, Sungbok
    Narayanan, Shrikanth
    ISM: 2008 IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, 2008, : 250 - 257
  • [47] Improved ConvMixer and Focal Loss with Dynamic Weight for Audio-Visual Emotion Recognition
    Shi, Shuo
    Qin, Jia-Jun
    Yu, Yang
    Hao, Xiao-Ke
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 52 (08): : 2824 - 2835
  • [48] Joint modelling of audio-visual cues using attention mechanisms for emotion recognition
    Ghaleb, Esam
    Niehues, Jan
    Asteriadis, Stylianos
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (08) : 11239 - 11264
  • [49] Learning Affective Features With a Hybrid Deep Model for Audio-Visual Emotion Recognition
    Zhang, Shiqing
    Zhang, Shiliang
    Huang, Tiejun
    Gao, Wen
    Tian, Qi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (10) : 3030 - 3043
  • [50] Joint modelling of audio-visual cues using attention mechanisms for emotion recognition
    Esam Ghaleb
    Jan Niehues
    Stylianos Asteriadis
    Multimedia Tools and Applications, 2023, 82 : 11239 - 11264