Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

被引：176

作者：

Albanie, Samuel ^{[1
]}

Nagrani, Arsha ^{[1
]}

Vedaldi, Andrea ^{[1
]}

Zisserman, Andrew ^{[1
]}

机构：

[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England

来源：

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年

基金：

英国工程与自然科学研究理事会;

关键词：

Cross-modal transfer; speech emotion recognition; FACE-LIKE STIMULI; FACIAL-EXPRESSION; PERCEPTION; VOICE;

D O I：

10.1145/3240508.3240578

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available(1).

引用

页码：292 / 301

页数：10

共 50 条

[41] Multimodal Emotion Recognition using Cross-Modal Attention and 1D Convolutional Neural Networks
Krishna, D. N.
Patil, Ankita
INTERSPEECH 2020, 2020, : 4243 - 4247
[42] Effects of Age on Cross-Modal Emotion Perception
Hunter, Edyta Monika
Phillips, Louise H.
MacPherson, Sarah E.
PSYCHOLOGY AND AGING, 2010, 25 (04) : 779 - 787
[43] CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition
Kim, Minsu
Hong, Joanna
Park, Se Jin
Ro, Yong Man
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 4342 - 4355
[44] Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
Radhakrishnan, Srijith
Yang, Chao-Han Huck
Khan, Sumeer Ahmad
Kumar, Rohit
Kiani, Narsis A.
Gomez-Cabreiro, David
Tegner, Jesper N.
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10007 - 10016
[45] Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
Yang, Chih-Chun
Fan, Wan-Cyuan
Yang, Cheng-Fu
Wang, Yu-Chiang Frank
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3036 - 3044
[46] Short Contributions Cross-Modal Transfer in Visual and Haptic Face Recognition
Dopjans, Lisa
Wallraven, Christian
Buelthoff, Heinrich H.
IEEE TRANSACTIONS ON HAPTICS, 2009, 2 (04) : 236 - 240
[47] HAPTIC AND CROSS-MODAL RECOGNITION IN CHILDREN
BUSHNELL, EW
BULLETIN OF THE PSYCHONOMIC SOCIETY, 1991, 29 (06) : 499 - 499
[48] CROSS-MODAL RECOGNITION IN CHIMPANZEES AND MONKEYS
JARVIS, MJ
ETTLINGER, G
NEUROPSYCHOLOGIA, 1977, 15 (4-5) : 499 - 506
[49] Cross-Modal Distillation for Speaker Recognition
Jin, Yufeng
Hu, Guosheng
Chen, Haonan
Miao, Duoqian
Hu, Liang
Zhao, Cairong
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12977 - 12985
[50] Cross-modal attention and letter recognition
Wesner, Michael
Miller, Lisa
INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2008, 43 (3-4) : 343 - 343

← 1 2 3 4 5 →