Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement

被引：1

作者：

Balasubramanian, S. ^{[1
]}

Rajavel, R. ^{[1
]}

Kar, Asutosh ^{[2
]}

机构：

[1] Anna Univ, SSN Coll Engn, Chennai 603110, Tamil Nadu, India

[2] Dr BR Ambedkar Natl Inst Technol Jalandhar, 44027, Jalandhar, Punjab, India

来源：

CIRCUITS SYSTEMS AND SIGNAL PROCESSING | 2023年 / 42卷 / 09期

关键词：

CNN; Audio-visual model; Speech enhancement; Ideal binary mask; NOISE; INTELLIGIBILITY; PREDICTION;

D O I：

10.1007/s00034-023-02340-3

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The estimation of the Ideal Binary Mask (IBM) based on speech cochleagram and visual cues were carried out in this paper to improve the speech intelligibility and quality using an Audio-Visual Convolutional Neural Network (AVCNN). Many speech enhancement techniques in the past depended heavily on audio attributes to reduce the noise present in the speech signal. Several studies have recently revealed that speech enhancement using visual data as an auxiliary input with audio data is more effective in reducing acoustic noise in speech signals. In the proposed work the multichannel CNN is used to extract the dynamics of both visual and audio signal features which were then integrated to estimate the threshold using the proposed algorithm to obtain the IBM for the enhancement of speech signal. The performance of the proposed model is evaluated primarily to measure the speech intelligibility in terms of STOI, ESTOI, and CSII additionally speech quality is also measured in terms PESQ, SSNR, CSIG, CBAK, and COVL. The evaluation results reveal that the proposed audio-visual mask estimation model outperforms the Audio-only, Visual-only, and existing audio-visual mask estimation models. The proposed AVCNN model, in turn, demonstrates its efficiency in merging the dynamics of audio information with visual speech information for speech enhancement.

引用

页码：5313 / 5337

页数：25

共 50 条

[1] Estimation of Ideal Binary Mask for Audio-Visual Monaural Speech Enhancement
S. Balasubramanian
R. Rajavel
Asutosh Kar
[J]. Circuits, Systems, and Signal Processing, 2023, 42 : 5313 - 5337
[2] Ideal ratio mask estimation based on cochleagram for audio-visual monaural speech enhancement
Balasubramanian, S.
Rajavel, R.
Kar, Asuthos
[J]. Applied Acoustics, 2023, 211
[3] Lite Audio-Visual Speech Enhancement
Chuang, Shang-Yi
Tsao, Yu
Lo, Chen-Chou
Wang, Hsin-Min
[J]. INTERSPEECH 2020, 2020, : 1131 - 1135
[4] Audio-visual enhancement of speech in noise
Girin, L
Schwartz, JL
Feng, G
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020
[5] Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis
Yang, Karren
Markovic, Dejan
Krenn, Steven
Agrawal, Vasu
Richard, Alexander
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 8217 - 8227
[6] DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation
Gogate, Mandar
Adeel, Ahsan
Marxer, Ricard
Barker, Jon
Hussain, Amir
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2723 - 2727
[7] Auditory Mask Estimation by RPCA for Monaural Speech Enhancement
Shi, Wenhua
Zhang, Xiongwei
Zou, Xia
Han, Wei
Min, Gang
[J]. 2017 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2017), 2017, : 179 - 184
[8] Audio-visual speech recognition based on joint training with audio-visual speech enhancement for robust speech recognition
Hwang, Jung-Wook
Park, Jeongkyun
Park, Rae-Hong
Park, Hyung-Min
[J]. APPLIED ACOUSTICS, 2023, 211
[9] Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization)
Deligne, S
Potamianos, G
Neti, C
[J]. SAM2002: IEEE SENSOR ARRAY AND MULTICHANNEL SIGNAL PROCESSING WORKSHOP PROCEEDINGS, 2002, : 68 - 71
[10] Improved Lite Audio-Visual Speech Enhancement
Chuang, Shang-Yi
Wang, Hsin-Min
Tsao, Yu
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1345 - 1359

← 1 2 3 4 5 →