Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

被引：176

作者：

Albanie, Samuel ^{[1
]}

Nagrani, Arsha ^{[1
]}

Vedaldi, Andrea ^{[1
]}

Zisserman, Andrew ^{[1
]}

机构：

[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England

来源：

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年

基金：

英国工程与自然科学研究理事会;

关键词：

Cross-modal transfer; speech emotion recognition; FACE-LIKE STIMULI; FACIAL-EXPRESSION; PERCEPTION; VOICE;

D O I：

10.1145/3240508.3240578

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available(1).

引用

页码：292 / 301

页数：10

共 50 条

[1] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
Yang, Dingkang
Huang, Shuai
Liu, Yang
Zhang, Lihua
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097
[2] Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
Zhang, Sheng
Chen, Min
Chen, Jincai
Li, Yuan-Fang
Wu, Yiling
Li, Minglei
Zhu, Chuanbo
KNOWLEDGE-BASED SYSTEMS, 2021, 229
[3] Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition
Hong, Soyeon
Kang, Hyeoungguk
Cho, Hyunsouk
IEEE ACCESS, 2024, 12 : 14324 - 14333
[4] Speech Emotion Recognition With Early Visual Cross-modal Enhancement Using Spiking Neural Networks
Mansouri-Benssassi, Esma
Ye, Juan
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[5] Speech Emotion Recognition Using Global-Aware Cross-Modal Feature Fusion Network
Li, Feng
Luo, Jiusong
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT II, 2023, 14087 : 211 - 221
[6] Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation
Chen, Lijiang
Ren, Jie
Mao, Xia
Zhao, Qi
APPLIED SCIENCES-BASEL, 2022, 12 (09):
[7] Speech Emotion Recognition via Multi-Level Cross-Modal Distillation
Li, Ruichen
Zhao, Jinming
Jin, Qin
INTERSPEECH 2021, 2021, : 4488 - 4492
[8] A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition
Gong, Peizhu
Liu, Jin
Wu, Zhongdai
Han, Bing
Wang, Y. Ken
He, Huihua
CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (02): : 4203 - 4220
[9] Cross-modal dynamic convolution for multi-modal emotion recognition
Wen, Huanglu
You, Shaodi
Fu, Ying
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 78
[10] Cross-modal individual recognition in wild African lions
Gilfillan, Geoffrey
Vitale, Jessica
McNutt, John Weldon
McComb, Karen
BIOLOGY LETTERS, 2016, 12 (08)

← 1 2 3 4 5 →