AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS

被引：0

作者：

Chern, I-Chun ^{[1
]}

Hung, Kuo-Hsuan ^{[2
,3
]}

Chen, Yi-Ting ^{[3
]}

Hussain, Tassadaq ^{[4
]}

Gogate, Mandar ^{[4
]}

Hussain, Amir ^{[4
]}

Tsao, Yu ^{[3
]}

Hou, Jen-Cheng ^{[3
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Natl Taiwan Univ, Taipei, Taiwan

[3] Acad Sinica, Taipei, Taiwan

[4] Edinburgh Napier Univ, Edinburgh, Scotland

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年

关键词：

Audio-Visual Speech Enhancement; Audio-Visual Speech Separation; AV-HuBERT;

D O I：

10.1109/ICASSPW59220.2023.10193049

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.

引用

页数：5

共 50 条

[1] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
Li, Yidi
Liu, Hong
Tang, Hao
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
[2] Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
Wang, Xiaoyu
Kong, Xiangyu
Peng, Xiulian
Lu, Yan
INTERSPEECH 2022, 2022, : 886 - 890
[3] Robust Self-Supervised Audio-Visual Speech Recognition
Shi, Bowen
Hsu, Wei-Ning
Mohamed, Abdelrahman
INTERSPEECH 2022, 2022, : 2118 - 2122
[4] Boosting Self-Supervised Embeddings for Speech Enhancement
Hung, Kuo-Hsuan
Fu, Szu-Wei
Tseng, Huan-Hsin
Chiang, Hsin-Tien
Tsao, Yu
Lin, Chii-Wann
INTERSPEECH 2022, 2022, : 186 - 190
[5] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
Pan, Xichen
Chen, Peiyu
Gong, Yichen
Zhou, Helong
Wang, Xinbing
Lin, Zhouhan
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
[6] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
Rouditchenko, Andrew
Zhao, Hang
Gan, Chuang
McDermott, Josh
Torralba, Antonio
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
[7] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
Ding, Yifan
Xu, Yong
Zhang, Shi-Xiong
Cong, Yahuan
Wang, Liqiang
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
[8] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
Sarkar, Pritam
Etemad, Ali
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
[9] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
Mo, Shentong
Tian, Yapeng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[10] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
Tellamekala, Mani Kumar
Valstar, Michel
Pound, Michael
Giesbrecht, Timo
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919

← 1 2 3 4 5 →