AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS

被引:0
|
作者
Chern, I-Chun [1 ]
Hung, Kuo-Hsuan [2 ,3 ]
Chen, Yi-Ting [3 ]
Hussain, Tassadaq [4 ]
Gogate, Mandar [4 ]
Hussain, Amir [4 ]
Tsao, Yu [3 ]
Hou, Jen-Cheng [3 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Natl Taiwan Univ, Taipei, Taiwan
[3] Acad Sinica, Taipei, Taiwan
[4] Edinburgh Napier Univ, Edinburgh, Scotland
关键词
Audio-Visual Speech Enhancement; Audio-Visual Speech Separation; AV-HuBERT;
D O I
10.1109/ICASSPW59220.2023.10193049
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking
    Li, Yidi
    Liu, Hong
    Tang, Hao
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 1456 - 1463
  • [2] Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
    Wang, Xiaoyu
    Kong, Xiangyu
    Peng, Xiulian
    Lu, Yan
    INTERSPEECH 2022, 2022, : 886 - 890
  • [3] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    INTERSPEECH 2022, 2022, : 2118 - 2122
  • [4] Boosting Self-Supervised Embeddings for Speech Enhancement
    Hung, Kuo-Hsuan
    Fu, Szu-Wei
    Tseng, Huan-Hsin
    Chiang, Hsin-Tien
    Tsao, Yu
    Lin, Chii-Wann
    INTERSPEECH 2022, 2022, : 186 - 190
  • [5] Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
    Pan, Xichen
    Chen, Peiyu
    Gong, Yichen
    Zhou, Helong
    Wang, Xinbing
    Lin, Zhouhan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4491 - 4503
  • [6] SELF-SUPERVISED AUDIO-VISUAL CO-SEGMENTATION
    Rouditchenko, Andrew
    Zhao, Hang
    Gan, Chuang
    McDermott, Josh
    Torralba, Antonio
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2357 - 2361
  • [7] SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL SPEAKER DIARIZATION
    Ding, Yifan
    Xu, Yong
    Zhang, Shi-Xiong
    Cong, Yahuan
    Wang, Liqiang
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4367 - 4371
  • [8] Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity
    Sarkar, Pritam
    Etemad, Ali
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9723 - 9732
  • [9] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
    Mo, Shentong
    Tian, Yapeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [10] Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning
    Tellamekala, Mani Kumar
    Valstar, Michel
    Pound, Michael
    Giesbrecht, Timo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9912 - 9919