A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement

被引:0
|
作者
Chen, Feilong [1 ]
Lin, Wenmo [1 ]
Sun, Chengli [1 ]
Guo, Qiaosheng [2 ]
机构
[1] Nanchang Hangkong Univ, Sch Informat Engn, Nanchang 330063, Peoples R China
[2] Chaoyang Jushengtai Xinfeng Technol Co Ltd, Ganzhou 341001, Peoples R China
关键词
Speech enhancement; 3D speech signal; Diffusion model; Beamforming; Multi-channel;
D O I
10.1007/s00034-024-02652-y
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Speech enhancement in 3D reverberant environments is a challenging and significant problem for many downstream applications, such as speech recognition, speaker identification, and audio analysis. Existing deep neural network models have shown efficacy for 3D speech enhancement tasks, but they often introduce distortions or unnatural artifacts in the enhanced speech. In this work, we propose a novel two-stage refiner system that integrates a neural beamforming network and a diffusion model for robust 3D speech enhancement. The neural beamforming network performs spatial filtering to suppress the noise and reverberation; while, the diffusion model leverages its generative capability to restore the missing or distorted speech components from the beamformed output. To the best of our knowledge, this is the first work that applies the diffusion model as a backend refiner to 3D speech enhancement. We investigate the effect of training the diffusion model with either enhanced speech or clean speech, and find that clean speech can better capture the prior knowledge of speech components and improve the speech recovery. We evaluate our proposed system on different datasets and beamformer architectures, and show that it achieves consistent improvements in metrics like WER and NISQA, indicating that the diffusion model has strong generalization ability and can serve as a backend refinement module for 3D speech enhancement, regardless of the front-end beamforming network. Our work demonstrates the effectiveness of integrating discriminative and generative models for robust 3D speech enhancement, and also opens up a new direction for applying generative diffusion models to 3D speech processing tasks, which can be used as a backend to various beamforming enhancement methods.
引用
收藏
页码:4369 / 4389
页数:21
相关论文
共 50 条
  • [41] Two-Stage Lesion Detection Approach Based on Dimension-Decomposition and 3D Context
    Jiao, Jiacheng
    Pan, Haiwei
    Chen, Chunling
    Jin, Tao
    Dong, Yang
    Chen, Jingyi
    TSINGHUA SCIENCE AND TECHNOLOGY, 2022, 27 (01) : 103 - 113
  • [42] Two-Stage RGB-Based Action Detection Using Augmented 3D Poses
    Papadopoulos, Konstantinos
    Ghorbel, Enjie
    Baptista, Renato
    Aouada, Djamila
    Ottersten, Bjoern
    COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2019, PT I, 2019, 11678 : 26 - 35
  • [43] Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation
    Shan, Wenkang
    Liu, Zhenhua
    Zhang, Xinfeng
    Wang, Zhao
    Han, Kai
    Wang, Shanshe
    Ma, Siwei
    Gao, Wen
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14715 - 14725
  • [44] Persistence for a Two-Stage Reaction-Diffusion System
    Cantrell, Robert Stephen
    Cosner, Chris
    Martinez, Salome
    MATHEMATICS, 2020, 8 (03)
  • [45] Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication
    Li, Junfeng
    Sakamoto, Shuichi
    Hongo, Satoshi
    Akagi, Masato
    Suzuki, Yoiti
    SPEECH COMMUNICATION, 2011, 53 (05) : 677 - 689
  • [46] Response analysis of 3D braided two-stage gear system excited by different frequency signals
    Zhang, Weiliang
    Wang, Xupeng
    Ji, Xiaomin
    Tang, Xinyao
    Liu, Fengfeng
    Liu, Shuwei
    Xue, Tengyuan
    ADVANCES IN MECHANICAL ENGINEERING, 2021, 13 (03)
  • [47] Cluster-Group-Based Two-Stage Beamforming for Massive MIMO
    Song, Yunchao
    Liu, Chen
    Wang, Wei
    Huang, Yongming
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2022, 70 (03) : 1984 - 1998
  • [48] Improved 3D head reconstruction system based on combining shape-from-silhouette with two-stage stereo algorithm
    Fujimura, K
    Oue, Y
    Terauchi, T
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 3, 2004, : 127 - 130
  • [49] Two-Stage System for Robust Neutral/Lombard Speech Recognition
    Boril, Hynek
    Fousek, Petr
    Hoege, Harald
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2936 - +
  • [50] Multimodal Interaction Grammar Analysis Based on Two-Stage User-Based Elicitation in 3D Modeling
    Hou, Wen-Jun
    Guo, Ge-Xin
    Cheng, Yi-Ting
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2024, 40 (08) : 2120 - 2141