Speech Emotion Recognition via Multi-Level Cross-Modal Distillation

被引：4

作者：

Li, Ruichen ^{[1
]}

Zhao, Jinming ^{[1
]}

Jin, Qin ^{[1
]}

机构：

[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

基金：

北京市自然科学基金; 中国国家自然科学基金; 国家重点研发计划;

关键词：

speech emotion recognition; cross-modal transfer; pretraining;

D O I：

10.21437/Interspeech.2021-785

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Speech emotion recognition faces the problem that most of the existing speech corpora are limited in scale and diversity due to the high annotation cost and label ambiguity. In this work, we explore the task of learning robust speech emotion representations based on large unlabeled speech data. Under a simple assumption that the internal emotional states across different modalities are similar, we propose a method called Multi-level Cross-modal Emotion Distillation (MCED), which trains the speech emotion model without any labeled speech emotion data by transferring emotion knowledge from a pretrained text emotion model. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, show that our proposed MCED can help learn effective speech emotion representations which generalize well on downstream speech emotion recognition tasks.

引用

页码：4488 / 4492

页数：5

共 50 条

[21] Cross-modal collaborative representation and multi-level supervision for crowd counting
Li, Shufang
Hu, Zhengping
Zhao, Mengyao
Bi, Shuai
Sun, Zhe
[J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (03) : 601 - 608
[22] Multi-level Alignment Network for Domain Adaptive Cross-modal Retrieval
Dong, Jianfeng
Long, Zhongzi
Mao, Xiaofeng
Lin, Changting
He, Yuan
Ji, Shouling
[J]. NEUROCOMPUTING, 2021, 440 : 207 - 219
[23] Cross-modal collaborative representation and multi-level supervision for crowd counting
Shufang Li
Zhengping Hu
Mengyao Zhao
Shuai Bi
Zhe Sun
[J]. Signal, Image and Video Processing, 2023, 17 : 601 - 608
[24] Semantic enhancement and multi-level alignment network for cross-modal retrieval
Chen, Jia
Zhang, Hong
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024,
[25] Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching
Liang, Jingjun
Li, Ruichen
Jin, Qin
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2852 - 2861
[26] Low-Order Multi-Level Features for Speech Emotion Recognition
Tamulevicius, Gintautas
Liogiene, Tatjana
[J]. BALTIC JOURNAL OF MODERN COMPUTING, 2015, 3 (04): : 234 - 247
[27] Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation
Cho, Won Ik
Kwak, Donghyun
Yoon, Ji Won
Kim, Nam Soo
[J]. INTERSPEECH 2020, 2020, : 896 - 900
[28] Image Emotion Recognition via Fusion Multi-Level Representations
Zhang, Hao
Li, Haipeng
Peng, Guoqin
Liu, Yan'an
Xu, Dan
[J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (10): : 1566 - 1576
[29] Multi-level cross-modal contrastive learning for review-aware recommendation
Wei, Yibiao
Xu, Yang
Zhu, Lei
Ma, Jingwei
Peng, Chengmei
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 247
[30] Speech2Video: Cross-Modal Distillation for Speech to Video Generation
Si, Shijing
Wang, Jianzong
Qu, Xiaoyang
Cheng, Ning
Wei, Wenqi
Zhu, Xinghua
Xiao, Jing
[J]. INTERSPEECH 2021, 2021, : 1629 - 1633

← 1 2 3 4 5 →