ISLA: Temporal Segmentation and Labeling for Audio-Visual Emotion Recognition

被引：27

作者：

Kim, Yelin ^{[1
]}

Provost, Emily Mower ^{[2
]}

机构：

[1] SUNY Albany, Dept Elect & Comp Engn, Albany, NY 12206 USA

[2] Univ Michigan, Dept Elect Engn & Comp Sci, Ann Arbor, MI 48109 USA

来源：

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING | 2019年 / 10卷 / 02期

关键词：

Audio-visual; emotion; recognition; multimodal; temporal; face region; speech; FACIAL EXPRESSION; SPEECH; CLASSIFICATION; MODALITIES; MOVEMENT; PROSODY; AREAS;

D O I：

10.1109/TAFFC.2017.2702653

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion is an essential part of human interaction. Automatic emotion recognition can greatly benefit human-centered interactive technology, since extracted emotion can be used to understand and respond to user needs. However, real-world emotion recognition faces a central challenge when a user is speaking: facial movements due to speech are often confused with facial movements related to emotion. Recent studies have found that the use of phonetic information can reduce speech-related variability in the lower face region. However, methods to differentiate upper face movements due to emotion and due to speech have been underexplored. This gap leads us to the proposal of the Informed Segmentation and Labeling Approach (ISLA). ISLA uses speech signals that alter the dynamics of the lower and upper face regions. We demonstrate how pitch can be used to improve estimates of emotion from the upper face, and how this estimate can be combined with emotion estimates from the lower face and speech in a multimodal classification system. Our emotion classification results on the IEMOCAP and SAVEE datasets show that ISLA improves overall classification performance. We also demonstrate how emotion estimates from different modalities correlate with each other, providing insights into the differences between posed and spontaneous expressions.

引用

页码：196 / 208

页数：13

共 50 条

[31] Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition
Wei, Jie
Hu, Guanyu
Yang, Xinyu
Luu, Anh Tuan
Dong, Yizhuo
INTERSPEECH 2022, 2022, : 1988 - 1992
[32] Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech
Zhang, Shiqing
Li, Lemin
Zhao, Zhijin
MULTIMEDIA AND SIGNAL PROCESSING, 2012, 346 : 46 - +
[33] Leveraging Inter-rater Agreement for Audio-Visual Emotion Recognition
Kim, Yelin
Provost, Emily Mower
2015 INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2015, : 553 - 559
[34] Metric Learning-Based Multimodal Audio-Visual Emotion Recognition
Ghaleb, Esam
Popa, Mirela
Asteriadis, Stylianos
IEEE MULTIMEDIA, 2020, 27 (01) : 37 - 48
[35] Audio-Visual Emotion Recognition Based on a DBN Model with Constrained Asynchrony
Chen, Danqi
Jiang, Dongmei
Ravyse, Ilse
Sahli, Hichem
PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON IMAGE AND GRAPHICS (ICIG 2009), 2009, : 912 - 916
[36] Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition
Guo, Peini
Chen, Zhengyan
Li, Yidi
Liu, Hong
ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 315 - 326
[37] Leveraging recent advances in deep learning for audio-Visual emotion recognition
Schoneveld, Liam
Othmani, Alice
Abdelkawy, Hazem
PATTERN RECOGNITION LETTERS, 2021, 146 : 1 - 7
[38] Learning Better Representations for Audio-Visual Emotion Recognition with Common Information
Ma, Fei
Zhang, Wei
Li, Yang
Huang, Shao-Lun
Zhang, Lin
APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 23
[39] Audio-visual affect recognition
Zeng, Zhihong
Tu, Jilin
Liu, Ming
Huang, Thomas S.
Pianfetti, Brian
Roth, Dan
Levinson, Stephen
IEEE TRANSACTIONS ON MULTIMEDIA, 2007, 9 (02) : 424 - 428
[40] Audio-visual integration of emotion expression
Collignon, Olivier
Girard, Simon
Gosselin, Frederic
Roy, Sylvain
Saint-Amour, Dave
Lassonde, Maryse
Lepore, Franco
BRAIN RESEARCH, 2008, 1242 : 126 - 135

← 1 2 3 4 5 →