Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

被引:1
|
作者
Wang, Jing [1 ]
Luo, Yiyu [1 ]
Yi, Weiming [2 ]
Xie, Xiang [1 ]
机构
[1] Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China
[2] Beijing Inst Technol, Sch Foreign Languages, Key Lab Language Cognit & Computat, Minist Ind & Informat Technol, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
audio-visual speech separation; multi-talker transformer; multi-head attention; lip embedding; time-frequency mask; ENHANCEMENT; INTELLIGIBILITY; QUALITY;
D O I
10.1587/transinf.2021EDP7020
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audiovisual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.
引用
收藏
页码:766 / 777
页数:12
相关论文
共 50 条
  • [1] FACE LANDMARK-BASED SPEAKER-INDEPENDENT AUDIO-VISUAL SPEECH ENHANCEMENT IN MULTI-TALKER ENVIRONMENTS
    Morrone, Giovanni
    Pasa, Luca
    Tikhanoff, Vadim
    Bergamaschi, Sonia
    Fadiga, Luciano
    Badino, Leonardo
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6900 - 6904
  • [2] PERMUTATION INVARIANT TRAINING OF DEEP MODELS FOR SPEAKER-INDEPENDENT MULTI-TALKER SPEECH SEPARATION
    Yul, Dang
    Kalbcek, Marten
    Tan, Zheng-Hua
    Jensen, Jesper
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 241 - 245
  • [3] Permutation invariant training of deep models for speaker-independent multi-talker speech separation
    Takahashi, Kohei
    Shiraishi, Toshihiko
    [J]. MECHANICAL ENGINEERING JOURNAL, 2023,
  • [4] AN EMPIRICAL STUDY OF VISUAL FEATURES FOR DNN BASED AUDIO-VISUAL SPEECH ENHANCEMENT IN MULTI-TALKER ENVIRONMENTS
    Shetu, Shrishti Saha
    Chakrabarty, Soumitro
    Habets, Emanuel A. P.
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8418 - 8422
  • [5] Audio-Visual Multi-Talker Speech Recognition in A Cocktail Party
    Wu, Yifei
    Hi, Chenda
    Yang, Song
    Wu, Zhongqin
    Qian, Yanmin
    [J]. INTERSPEECH 2021, 2021, : 3021 - 3025
  • [6] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
  • [7] Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments
    Luo, Yiyu
    Wang, Jing
    Xu, Liang
    Yang, Lidong
    [J]. INTERSPEECH 2021, 2021, : 1104 - 1108
  • [8] An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement
    Sun, Zhongbo
    Wang, Yannan
    Cao, Li
    [J]. MULTIMEDIA MODELING (MMM 2020), PT II, 2020, 11962 : 722 - 728
  • [9] Acoustic scene complexity affects motion behavior during speech perception in audio-visual multi-talker virtual environments
    Slomianka, Valeska
    Dau, Torsten
    Ahrens, Axel
    [J]. SCIENTIFIC REPORTS, 2024, 14 (01):
  • [10] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076