An Electroglottograph Auxiliary Neural Network for Target Speaker Extraction

被引:3
|
作者
Chen, Lijiang [1 ]
Mo, Zhendong [1 ]
Ren, Jie [1 ]
Cui, Chunfeng [1 ]
Zhao, Qi [1 ]
机构
[1] Beihang Univ, Sch Elect & Informat Engn, 37 Xueyuan Rd, Beijing 100191, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 01期
基金
中国国家自然科学基金; 美国国家科学基金会;
关键词
speech extraction; SpeakerBeam; electroglottograph; pre-processing; SPEECH;
D O I
10.3390/app13010469
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The extraction of a target speaker from mixtures of different speakers has attracted extensive amounts of attention and research. Previous studies have proposed several methods, such as SpeakerBeam, to tackle this speech extraction problem using clean speech from the target speaker to provide information. However, clean speech cannot be obtained immediately in most cases. In this study, we addressed this problem by extracting features from the electroglottographs (EGGs) of target speakers. An EGG is a laryngeal function detection technology that can detect the impedance and condition of vocal cords. Since EGGs have excellent anti-noise performance due to the collection method, they can be obtained in rather noisy environments. In order to obtain clean speech from target speakers out of the mixtures of different speakers, we utilized deep learning methods and used EGG signals as additional information to extract target speaker. In this way, we could extract target speaker from mixtures of different speakers without needing clean speech from the target speakers. According to the characteristics of the EGG signals, we developed an EGG_auxiliary network to train a speaker extraction model under the assumption that EGG signals carry information about speech signals. Additionally, we took the correlations between EGGs and speech signals in silent and unvoiced segments into consideration to develop a new network involving EGG preprocessing. We achieved improvements in the scale invariant signal-to-distortion ratio improvement (SISDRi) of 0.89 dB on the Chinese Dual-Mode Emotional Speech Database (CDESD) and 1.41 dB on the EMO-DB dataset. In addition, our methods solved the problem of poor performance with target speakers of the same gender and the different between the same gender situation and the problem of greatly reduced precision under the low SNR circumstances.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures
    Zmolikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Ochiai, Tsubasa
    Nakatani, Tomohiro
    Burget, Lukas
    Cernocky, Jan
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) : 800 - 814
  • [2] COMPACT NETWORK FOR SPEAKERBEAM TARGET SPEAKER EXTRACTION
    Delcroix, Marc
    Zmolikova, Katerina
    Ochiai, Tsubasa
    Kinoshita, Keisuke
    Araki, Shoko
    Nakatani, Tomohiro
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6965 - 6969
  • [3] LEARNING SPEAKER REPRESENTATION FOR NEURAL NETWORK BASED MULTICHANNEL SPEAKER EXTRACTION
    Zmolikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Higuchi, Takuya
    Ogawa, Atsunori
    Nakatani, Tomohiro
    [J]. 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 8 - 15
  • [4] Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
    Wang, Wupeng
    Xu, Chenglin
    Ge, Meng
    Li, Haizhou
    [J]. INTERSPEECH 2021, 2021, : 3535 - 3539
  • [5] TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition
    Li, Wenjie
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. ELECTRONICS LETTERS, 2019, 55 (14) : 816 - 818
  • [6] A Target Speaker Separation Neural Network with Joint-Training
    Yang, Wenjing
    Wang, Jing
    Li, Hongfeng
    Xu, Na
    Xiang, Fei
    Qian, Kai
    Hu, Shenghua
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 614 - 618
  • [7] Speaker-aware neural network based beamformer for speaker extraction in speech mixtures
    Zmplikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Higuchi, Takuya
    Ogawa, Atsunori
    Nakatani, Tomohiro
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2655 - 2659
  • [8] Characterization Vector Extraction Using Neural Network for Speaker Recognition
    Wang, Wenchao
    Yuan, Qingsheng
    Zhou, Ruohua
    Yan, Yonghong
    [J]. 2016 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL. 1, 2016, : 355 - 358
  • [9] IMPROVING RNN TRANSDUCER WITH TARGET SPEAKER EXTRACTION AND NEURAL UNCERTAINTY ESTIMATION
    Shi, Jiatong
    Zhang, Chunlei
    Weng, Chao
    Watanabe, Shinji
    Yu, Meng
    Yu, Dong
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6908 - 6912
  • [10] Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition
    Kanda, Naoyuki
    Horiguchi, Shota
    Takashima, Ryoichi
    Fujita, Yusuke
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. INTERSPEECH 2019, 2019, : 236 - 240