Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

被引:176
|
作者
Albanie, Samuel [1 ]
Nagrani, Arsha [1 ]
Vedaldi, Andrea [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England
基金
英国工程与自然科学研究理事会;
关键词
Cross-modal transfer; speech emotion recognition; FACE-LIKE STIMULI; FACIAL-EXPRESSION; PERCEPTION; VOICE;
D O I
10.1145/3240508.3240578
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available(1).
引用
收藏
页码:292 / 301
页数:10
相关论文
共 50 条
  • [1] Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition
    Yang, Dingkang
    Huang, Shuai
    Liu, Yang
    Zhang, Lihua
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2093 - 2097
  • [2] Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition
    Zhang, Sheng
    Chen, Min
    Chen, Jincai
    Li, Yuan-Fang
    Wu, Yiling
    Li, Minglei
    Zhu, Chuanbo
    KNOWLEDGE-BASED SYSTEMS, 2021, 229
  • [3] Cross-Modal Dynamic Transfer Learning for Multimodal Emotion Recognition
    Hong, Soyeon
    Kang, Hyeoungguk
    Cho, Hyunsouk
    IEEE ACCESS, 2024, 12 : 14324 - 14333
  • [4] Speech Emotion Recognition With Early Visual Cross-modal Enhancement Using Spiking Neural Networks
    Mansouri-Benssassi, Esma
    Ye, Juan
    2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [5] Speech Emotion Recognition Using Global-Aware Cross-Modal Feature Fusion Network
    Li, Feng
    Luo, Jiusong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT II, 2023, 14087 : 211 - 221
  • [6] Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation
    Chen, Lijiang
    Ren, Jie
    Mao, Xia
    Zhao, Qi
    APPLIED SCIENCES-BASEL, 2022, 12 (09):
  • [7] Speech Emotion Recognition via Multi-Level Cross-Modal Distillation
    Li, Ruichen
    Zhao, Jinming
    Jin, Qin
    INTERSPEECH 2021, 2021, : 4488 - 4492
  • [8] A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition
    Gong, Peizhu
    Liu, Jin
    Wu, Zhongdai
    Han, Bing
    Wang, Y. Ken
    He, Huihua
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (02): : 4203 - 4220
  • [9] Cross-modal dynamic convolution for multi-modal emotion recognition
    Wen, Huanglu
    You, Shaodi
    Fu, Ying
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 78
  • [10] Cross-modal individual recognition in wild African lions
    Gilfillan, Geoffrey
    Vitale, Jessica
    McNutt, John Weldon
    McComb, Karen
    BIOLOGY LETTERS, 2016, 12 (08)