Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement

被引:1
|
作者
Balasubramanian, S. [1 ]
Rajavel, R. [1 ]
Kar, Asutosh [2 ]
机构
[1] Anna Univ, SSN Coll Engn, Chennai 603110, Tamil Nadu, India
[2] Dr BR Ambedkar Natl Inst Technol Jalandhar, 44027, Jalandhar, Punjab, India
关键词
CNN; Audio-visual model; Speech enhancement; Ideal binary mask; NOISE; INTELLIGIBILITY; PREDICTION;
D O I
10.1007/s00034-023-02340-3
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The estimation of the Ideal Binary Mask (IBM) based on speech cochleagram and visual cues were carried out in this paper to improve the speech intelligibility and quality using an Audio-Visual Convolutional Neural Network (AVCNN). Many speech enhancement techniques in the past depended heavily on audio attributes to reduce the noise present in the speech signal. Several studies have recently revealed that speech enhancement using visual data as an auxiliary input with audio data is more effective in reducing acoustic noise in speech signals. In the proposed work the multichannel CNN is used to extract the dynamics of both visual and audio signal features which were then integrated to estimate the threshold using the proposed algorithm to obtain the IBM for the enhancement of speech signal. The performance of the proposed model is evaluated primarily to measure the speech intelligibility in terms of STOI, ESTOI, and CSII additionally speech quality is also measured in terms PESQ, SSNR, CSIG, CBAK, and COVL. The evaluation results reveal that the proposed audio-visual mask estimation model outperforms the Audio-only, Visual-only, and existing audio-visual mask estimation models. The proposed AVCNN model, in turn, demonstrates its efficiency in merging the dynamics of audio information with visual speech information for speech enhancement.
引用
收藏
页码:5313 / 5337
页数:25
相关论文
共 50 条
  • [1] Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement
    S. Balasubramanian
    R. Rajavel
    Asutosh Kar
    [J]. Circuits, Systems, and Signal Processing, 2023, 42 : 5313 - 5337
  • [2] Ideal ratio mask estimation based on cochleagram for audio-visual monaural speech enhancement
    Balasubramanian, S.
    Rajavel, R.
    Kar, Asuthos
    [J]. Applied Acoustics, 2023, 211
  • [3] Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Tsao, Yu
    Lo, Chen-Chou
    Wang, Hsin-Min
    [J]. INTERSPEECH 2020, 2020, : 1131 - 1135
  • [4] Audio-visual enhancement of speech in noise
    Girin, L
    Schwartz, JL
    Feng, G
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020
  • [5] Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
    Yang, Karren
    Markovic, Dejan
    Krenn, Steven
    Agrawal, Vasu
    Richard, Alexander
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8217 - 8227
  • [6] DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation
    Gogate, Mandar
    Adeel, Ahsan
    Marxer, Ricard
    Barker, Jon
    Hussain, Amir
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2723 - 2727
  • [7] Auditory Mask Estimation by RPCA for Monaural Speech Enhancement
    Shi, Wenhua
    Zhang, Xiongwei
    Zou, Xia
    Han, Wei
    Min, Gang
    [J]. 2017 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2017), 2017, : 179 - 184
  • [8] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
    Hwang, Jung-Wook
    Park, Jeongkyun
    Park, Rae-Hong
    Park, Hyung-Min
    [J]. APPLIED ACOUSTICS, 2023, 211
  • [9] Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)
    Deligne, S
    Potamianos, G
    Neti, C
    [J]. SAM2002: IEEE SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP PROCEEDINGS, 2002, : 68 - 71
  • [10] Improved Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Wang, Hsin-Min
    Tsao, Yu
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1345 - 1359