Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

被引:19
|
作者
Zhang, Sheng [1 ]
Chen, Min [3 ,4 ]
Chen, Jincai [1 ,2 ,3 ]
Li, Yuan-Fang [6 ]
Wu, Yiling [5 ]
Li, Minglei [5 ]
Zhu, Chuanbo [1 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan Natl Lab Optoelect, Wuhan 430074, Peoples R China
[2] Minist Educ China, Key Lab Informat Storage Syst, Engn Res Ctr Data Storage Syst & Technol, Wuhan 430074, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan 430074, Peoples R China
[4] Huazhong Univ Sci & Technol, Embedded & Pervas Comp EPIC Lab, Wuhan 430074, Peoples R China
[5] Huawei Cloud BU, Shenzhen 518129, Peoples R China
[6] Monash Univ, Fac Informat Technol, Dept Data Sci & AI, Clayton, Vic 3800, Australia
基金
中国国家自然科学基金;
关键词
Semi-supervised learning; Cross-modal knowledge transfer; Speech emotion recognition; SENTIMENT; MODEL;
D O I
10.1016/j.knosys.2021.107340
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech emotion recognition is an important task with a wide range of applications. However, the progress of speech emotion recognition is limited by the lack of large, high-quality labeled speech datasets, due to the high annotation cost and the inherent ambiguity in emotion labels. The recent emergence of large-scale video data makes it possible to obtain massive, though unlabeled speech data. To exploit this unlabeled data, previous works have explored semi-supervised learning methods on various tasks. However, noisy pseudo-labels remain a challenge for these methods. In this work, to alleviate the above issue, we propose a new architecture that combines cross-modal knowledge transfer from visual to audio modality into our semi-supervised learning method with consistency regularization. We posit that introducing visual emotional knowledge by the cross-modal transfer method can increase the diversity and accuracy of pseudo-labels and improve the robustness of the model. To combine knowledge from cross-modal transfer and semi-supervised learning, we design two fusion algorithms, i.e. weighted fusion and consistent & random. Our experiments on CH-SIMS and IEMOCAP datasets show that our method can effectively use additional unlabeled audio-visual data to outperform state-of-the-art results. (C) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Enhancing Semi-Supervised Learning with Cross-Modal Knowledge
    Zhu, Hui
    Lu, Yongchun
    Wang, Hongbin
    Zhou, Xunyi
    Ma, Qin
    Liu, Yanhong
    Jiang, Ning
    Wei, Xin
    Zeng, Linchengxi
    Zhao, Xiaofang
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4456 - 4465
  • [2] Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
    Liang, Jingjun
    Li, Ruichen
    Jin, Qin
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2852 - 2861
  • [3] Semi-Supervised Knowledge Distillation for Cross-Modal Hashing
    Su, Mingyue
    Gu, Guanghua
    Ren, Xianlong
    Fu, Hao
    Zhao, Yao
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 662 - 675
  • [4] Semi-supervised cross-modal learning for cross modal retrieval and image annotation
    Fuhao Zou
    Xingqiang Bai
    Chaoyang Luan
    Kai Li
    Yunfei Wang
    Hefei Ling
    [J]. World Wide Web, 2019, 22 : 825 - 841
  • [5] Semi-supervised cross-modal learning for cross modal retrieval and image annotation
    Zou, Fuhao
    Bai, Xingqiang
    Luan, Chaoyang
    Li, Kai
    Wang, Yunfei
    Ling, Hefei
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 825 - 841
  • [6] A semi-supervised cross-modal memory bank for cross-modal retrieval
    Huang, Yingying
    Hu, Bingliang
    Zhang, Yipeng
    Gao, Chi
    Wang, Quan
    [J]. NEUROCOMPUTING, 2024, 579
  • [7] Adaptively Unified Semi-supervised Learning for Cross-Modal Retrieval
    Zhang, Liang
    Ma, Bingpeng
    He, Jianfeng
    Li, Guorong
    Huang, Qingming
    Tian, Qi
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3406 - 3412
  • [8] Semi-supervised cross-lingual speech emotion recognition
    Agarla, Mirko
    Bianco, Simone
    Celona, Luigi
    Napoletano, Paolo
    Petrovsky, Alexey
    Piccoli, Flavio
    Schettini, Raimondo
    Shanin, Ivan
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
  • [9] Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
    Albanie, Samuel
    Nagrani, Arsha
    Vedaldi, Andrea
    Zisserman, Andrew
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 292 - 301
  • [10] Semi-supervised Model for Emotion Recognition in Speech
    Pereira, Ingryd
    Santos, Diego
    Maciel, Alexandre
    Barros, Pablo
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT I, 2018, 11139 : 791 - 800