Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

被引:176
|
作者
Albanie, Samuel [1 ]
Nagrani, Arsha [1 ]
Vedaldi, Andrea [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England
来源
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年
基金
英国工程与自然科学研究理事会;
关键词
Cross-modal transfer; speech emotion recognition; FACE-LIKE STIMULI; FACIAL-EXPRESSION; PERCEPTION; VOICE;
D O I
10.1145/3240508.3240578
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available(1).
引用
收藏
页码:292 / 301
页数:10
相关论文
共 50 条
  • [41] Multimodal Emotion Recognition using Cross-Modal Attention and 1D Convolutional Neural Networks
    Krishna, D. N.
    Patil, Ankita
    INTERSPEECH 2020, 2020, : 4243 - 4247
  • [42] Effects of Age on Cross-Modal Emotion Perception
    Hunter, Edyta Monika
    Phillips, Louise H.
    MacPherson, Sarah E.
    PSYCHOLOGY AND AGING, 2010, 25 (04) : 779 - 787
  • [43] CroMM-VSR: Cross-Modal Memory Augmented Visual Speech Recognition
    Kim, Minsu
    Hong, Joanna
    Park, Se Jin
    Ro, Yong Man
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 4342 - 4355
  • [44] Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition
    Radhakrishnan, Srijith
    Yang, Chao-Han Huck
    Khan, Sumeer Ahmad
    Kumar, Rohit
    Kiani, Narsis A.
    Gomez-Cabreiro, David
    Tegner, Jesper N.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10007 - 10016
  • [45] Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
    Yang, Chih-Chun
    Fan, Wan-Cyuan
    Yang, Cheng-Fu
    Wang, Yu-Chiang Frank
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 3036 - 3044
  • [46] Short Contributions Cross-Modal Transfer in Visual and Haptic Face Recognition
    Dopjans, Lisa
    Wallraven, Christian
    Buelthoff, Heinrich H.
    IEEE TRANSACTIONS ON HAPTICS, 2009, 2 (04) : 236 - 240
  • [47] HAPTIC AND CROSS-MODAL RECOGNITION IN CHILDREN
    BUSHNELL, EW
    BULLETIN OF THE PSYCHONOMIC SOCIETY, 1991, 29 (06) : 499 - 499
  • [48] CROSS-MODAL RECOGNITION IN CHIMPANZEES AND MONKEYS
    JARVIS, MJ
    ETTLINGER, G
    NEUROPSYCHOLOGIA, 1977, 15 (4-5) : 499 - 506
  • [49] Cross-Modal Distillation for Speaker Recognition
    Jin, Yufeng
    Hu, Guosheng
    Chen, Haonan
    Miao, Duoqian
    Hu, Liang
    Zhao, Cairong
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12977 - 12985
  • [50] Cross-modal attention and letter recognition
    Wesner, Michael
    Miller, Lisa
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2008, 43 (3-4) : 343 - 343