DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

被引:13
|
作者
Gogate, Mandar [1 ]
Adeel, Ahsan [1 ]
Marxer, Ricard [2 ,3 ]
Barker, Jon [3 ]
Hussain, Amir [1 ]
机构
[1] Univ Stirling, Stirling, Scotland
[2] Aix Marseille Univ, Univ Toulon, CNRS, LIS, Marseille, France
[3] Univ Sheffield, Sheffield, S Yorkshire, England
基金
英国工程与自然科学研究理事会;
关键词
Speech Separation; Binary Mask Estimation; Deep Neural Network; Speech Enhancement; NOISE;
D O I
10.21437/Interspeech.2018-2516
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The proposed AV mask estimation model contextually integrates the temporal dynamics of both audio and noise-immune visual features for improved mask estimation and speech separation. For optimal AV features extraction and ideal binary mask (IBM) estimation, a hybrid DNN architecture is exploited to leverages the complementary strengths of a stacked long short term memory (LSTM) and convolution LSTM network. The comparative simulation results in terms of speech quality and intelligibility demonstrate significant performance improvement of our proposed AV mask estimation model as compared to audio-only and visual-only mask estimation approaches for both speaker dependent and independent scenarios.
引用
收藏
页码:2723 / 2727
页数:5
相关论文
共 50 条
  • [1] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [2] Speaker independent audio-visual continuous speech recognition
    Liang, LH
    Liu, XX
    Zhao, YB
    Pi, XB
    Nefian, AV
    [J]. IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A25 - A28
  • [3] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
  • [4] Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments
    Wang, Jing
    Luo, Yiyu
    Yi, Weiming
    Xie, Xiang
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 766 - 777
  • [5] Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement
    Balasubramanian, S.
    Rajavel, R.
    Kar, Asutosh
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (09) : 5313 - 5337
  • [6] Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement
    S. Balasubramanian
    R. Rajavel
    Asutosh Kar
    [J]. Circuits, Systems, and Signal Processing, 2023, 42 : 5313 - 5337
  • [7] Real-time speaker localization and speech separation by audio-visual integration
    Nakadai, K
    Hidai, K
    Okuno, HG
    Kitano, H
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VOLS I-IV, PROCEEDINGS, 2002, : 1043 - 1049
  • [8] Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars
    Ivanko, Denis
    Ryumin, Dmitry
    Axyonov, Alexandr
    Kashevnik, Alexey
    Karpov, Alexey
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1555 - 1559
  • [9] Audio-Visual Speech Recognition in the Presence of a Competing Speaker
    Shao, Xu
    Barker, Jon
    [J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1292 - 1295
  • [10] Audio-Visual Multilevel Fusion for Speech and Speaker Recognition
    Chetty, Girija
    Wagner, Michael
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 379 - 382