Deep Neural Network Driven Binaural Audio Visual Speech Separation

被引:6
|
作者
Gogate, Mandar [1 ]
Dashtipour, Kia [1 ]
Bell, Peter [2 ]
Hussain, Amir [1 ]
机构
[1] Edinburgh Napier Univ, Sch Comp, Edinburgh, Midlothian, Scotland
[2] Univ Edinburgh, Sch Informat, Edinburgh, Midlothian, Scotland
基金
英国工程与自然科学研究理事会;
关键词
Binaural Speech Separation; Audio-Visual; Deep Learning; Mask Estimation; ENHANCEMENT;
D O I
10.1109/ijcnn48605.2020.9207517
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The central auditory pathway exploits the auditory signals and visual information sent by both ears and eyes to segregate speech from multiple competing noise sources and help disambiguate phonological ambiguity. In this study, inspired from this unique human ability, we present a deep neural network (DNN) that ingest the binaural sounds received at the two ears as well as the visual frames to selectively suppress the competing noise sources individually at both ears. The model exploits the noisy binaural cues and noise robust visual cues to improve speech intelligibility. The comparative simulation results in terms of objective metrics such as PESQ, STOI, SI-SDR and DBSTOI demonstrate significant performance improvement of the proposed audio-visual (AV) DNN as compared to the audioonly (A-only) variant of the proposed model. Finally, subjective listening tests with the real noisy AV ASPIRE corpus shows the superiority of the proposed AV DNN as compared to state-of-the-art approaches.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Binaural Deep Neural Network for Robust Speech Enhancement
    Jiang, Yi
    Liu, Runsheng
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (ICSPCC), 2014, : 692 - 695
  • [2] Audio-Visual Deep Clustering for Speech Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1697 - 1712
  • [3] Binaural reverberant Speech separation based on deep neural networks
    Zhang, Xueliang
    Wang, DeLiang
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2018 - 2022
  • [4] LIGHT-WEIGHT VISUALVOICE: NEURAL NETWORK QUANTIZATION ON AUDIO VISUAL SPEECH SEPARATION
    Wu, Yifei
    Li, Chenda
    Qian, Yanmin
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [5] Audio-Visual (Multimodal) Speech Recognition System Using Deep Neural Network
    Paulin, Hebsibah
    Milton, R. S.
    JanakiRaman, S.
    Chandraprabha, K.
    [J]. JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 3963 - 3974
  • [6] Combining audio and visual speech recognition using LSTM and deep convolutional neural network
    Shashidhar R.
    Patilkulkarni S.
    Puneeth S.B.
    [J]. International Journal of Information Technology, 2022, 14 (7) : 3425 - 3436
  • [7] REVERBERANT SPEECH SEPARATION BASED ON AUDIO-VISUAL DICTIONARY LEARNING AND BINAURAL CUES
    Liu, Qingju
    Wang, Wenwu
    Jackson, Philip
    Barnard, Mark
    [J]. 2012 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP), 2012, : 664 - 667
  • [8] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
    Li, Chenda
    Qian, Yanmin
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
  • [9] Binaural Deep Neural Network for Noise Robust Automatic Speech Recognition
    Jiang, Yi
    Zu, Yuan-Yuan
    [J]. INTERNATIONAL CONFERENCE ON CONTROL ENGINEERING AND AUTOMATION (ICCEA 2014), 2014, : 512 - 517
  • [10] A REGRESSION APPROACH TO BINAURAL SPEECH SEGREGATION VIA DEEP NEURAL NETWORK
    Fan, Nana
    Du, Jun
    Dai, Li-Rona
    [J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,