ISLA: Temporal Segmentation and Labeling for Audio-Visual Emotion Recognition

被引:27
|
作者
Kim, Yelin [1 ]
Provost, Emily Mower [2 ]
机构
[1] SUNY Albany, Dept Elect & Comp Engn, Albany, NY 12206 USA
[2] Univ Michigan, Dept Elect Engn & Comp Sci, Ann Arbor, MI 48109 USA
关键词
Audio-visual; emotion; recognition; multimodal; temporal; face region; speech; FACIAL EXPRESSION; SPEECH; CLASSIFICATION; MODALITIES; MOVEMENT; PROSODY; AREAS;
D O I
10.1109/TAFFC.2017.2702653
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Emotion is an essential part of human interaction. Automatic emotion recognition can greatly benefit human-centered interactive technology, since extracted emotion can be used to understand and respond to user needs. However, real-world emotion recognition faces a central challenge when a user is speaking: facial movements due to speech are often confused with facial movements related to emotion. Recent studies have found that the use of phonetic information can reduce speech-related variability in the lower face region. However, methods to differentiate upper face movements due to emotion and due to speech have been underexplored. This gap leads us to the proposal of the Informed Segmentation and Labeling Approach (ISLA). ISLA uses speech signals that alter the dynamics of the lower and upper face regions. We demonstrate how pitch can be used to improve estimates of emotion from the upper face, and how this estimate can be combined with emotion estimates from the lower face and speech in a multimodal classification system. Our emotion classification results on the IEMOCAP and SAVEE datasets show that ISLA improves overall classification performance. We also demonstrate how emotion estimates from different modalities correlate with each other, providing insights into the differences between posed and spontaneous expressions.
引用
收藏
页码:196 / 208
页数:13
相关论文
共 50 条
  • [1] Temporal aggregation of audio-visual modalities for emotion recognition
    Birhala, Andreea
    Ristea, Catalin Nicolae
    Radoi, Anamaria
    Dutu, Liviu Cristian
    2020 43RD INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2020, : 305 - 308
  • [2] Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition
    Ghaleb, Esam
    Popa, Mirela
    Asteriadis, Stylianos
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [3] Audio-visual spontaneous emotion recognition
    Zeng, Zhihong
    Hu, Yuxiao
    Roisman, Glenn I.
    Wen, Zhen
    Fu, Yun
    Huang, Thomas S.
    ARTIFICIAL INTELLIGENCE FOR HUMAN COMPUTING, 2007, 4451 : 72 - +
  • [4] Audio-Visual Learning for Multimodal Emotion Recognition
    Fan, Siyu
    Jing, Jianan
    Wang, Chongwen
    SYMMETRY-BASEL, 2025, 17 (03):
  • [5] Audio-Visual Attention Networks for Emotion Recognition
    Lee, Jiyoung
    Kim, Sunok
    Kim, Seungryong
    Sohn, Kwanghoon
    AVSU'18: PROCEEDINGS OF THE 2018 WORKSHOP ON AUDIO-VISUAL SCENE UNDERSTANDING FOR IMMERSIVE MULTIMEDIA, 2018, : 27 - 32
  • [6] Deep operational audio-visual emotion recognition
    Akturk, Kaan
    Keceli, Ali Seydi
    NEUROCOMPUTING, 2024, 588
  • [7] Audio-Visual Emotion Recognition in Video Clips
    Noroozi, Fatemeh
    Marjanovic, Marina
    Njegus, Angelina
    Escalera, Sergio
    Anbarjafari, Gholamreza
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2019, 10 (01) : 60 - 75
  • [8] AUDIO-VISUAL EMOTION RECOGNITION WITH BOOSTED COUPLED HMM
    Lu, Kun
    Jia, Yunde
    2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 1148 - 1151
  • [9] Audio-visual based emotion recognition - A new approach
    Song, ML
    Bu, JJ
    Chen, C
    Li, N
    PROCEEDINGS OF THE 2004 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, 2004, : 1020 - 1025
  • [10] AUDIO-VISUAL EMOTION RECOGNITION USING BOLTZMANN ZIPPERS
    Lu, Kun
    Jia, Yunde
    2012 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2012), 2012, : 2589 - 2592