Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network

被引：29

作者：

Tan, Ke ^{[1
]}

Xu, Yong ^{[2
]}

Zhang, Shi-Xiong ^{[2
]}

Yu, Meng ^{[2
]}

Yu, Dong ^{[2
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Tencent AI Lab, Bellevue, WA 98004 USA

来源：

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING | 2020年 / 14卷 / 03期

关键词：

Noise measurement; Visualization; Speech processing; Reverberation; Microphone arrays; Training; Audio-visual; multimodal; speech separation and dereverberation; far-field; two-stage; deep learning; ENHANCEMENT; MASKING;

D O I：

10.1109/JSTSP.2020.2987209

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that exploits both audio and visual signals. The proposed network architecture adopts a two-stage strategy, where a separation module is employed to attenuate background noise and interfering speech in the first stage and a dereverberation module to suppress room reverberation in the second stage. The two modules are first trained separately, and then integrated for joint training, which is based on a new multi-objective loss function. Our experimental results show that the proposed multimodal network yields consistently better objective intelligibility and perceptual quality than several one-stage and two-stage baselines. We find that our network achieves a 21.10% improvement in ESTOI and a 0.79 improvement in PESQ over the unprocessed mixtures. Moreover, our network architecture does not require the knowledge of the number of speakers.

引用

页码：542 / 553

页数：12

共 50 条

[1] Two-stage audio-visual speech dereverberation and separation based on models of the interaural spatial cues and spatial covariance
Khan, Muhammad Salman
Naqvi, Syed Mohsen
Chambers, Jonathon
[J]. 2013 18TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2013,
[2] AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION
Li, Guinan
Yu, Jianwei
Deng, Jiajun
Liu, Xunying
Meng, Helen
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6042 - 6046
[3] A Two-Stage Audio-Visual Speech Separation Method Without Visual Signals for Testing and Tuples Loss With Dynamic Margin
Liu, Yinggang
Deng, Yuanjie
Wei, Ying
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2024, 18 (03) : 459 - 472
[4] Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition
Song, Qiya
Sun, Bin
Li, Shutao
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10028 - 10038
[5] DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization
Hu, Ruihan
Zhou, Songbing
Tang, Zhi Ri
Chang, Sheng
Huang, Qijun
Liu, Yisen
Han, Wei
Wu, Edmond Q.
[J]. NEURAL NETWORKS, 2021, 133 : 229 - 239
[6] Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition
Li, Guinan
Deng, Jiajun
Geng, Mengzhe
Jin, Zengrui
Wang, Tianzi
Hu, Shujie
Cui, Mingyu
Meng, Helen
Liu, Xunying
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2707 - 2723
[7] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
Sodoyer, D
Schwartz, JL
Girin, L
Klinkisch, J
Jutten, C
[J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1165 - 1173
[8] Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli
David Sodoyer
Jean-Luc Schwartz
Laurent Girin
Jacob Klinkisch
Christian Jutten
[J]. EURASIP Journal on Advances in Signal Processing, 2002
[9] Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation
Liu, Debang
Zhang, Tianqi
Christensen, Mads Graesboll
Yi, Chen
An, Zeliang
[J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2024, 32 : 4647 - 4660
[10] Audio-Visual Deep Clustering for Speech Separation
Lu, Rui
Duan, Zhiyao
Zhang, Changshui
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1697 - 1712

← 1 2 3 4 5 →