Multi-Cue Guided Semi-Supervised Learning Toward Target Speaker Separation in Real Environments

被引:1
|
作者
Xu, Jiaming [1 ]
Cui, Jian [2 ,3 ]
Hao, Yunzhe [2 ,3 ]
Xu, Bo [2 ,3 ,4 ]
机构
[1] Xiaomi Corp, Beijing 100085, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
[3] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 101408, Peoples R China
[4] Chinese Acad Sci, Ctr Excellence Brain Sci & Intelligence Technol, Shanghai 200031, Peoples R China
关键词
Cocktail party problem; target speaker separation; multi-cue guided separation; semi-supervised learning; SPEECH RECOGNITION; EXTRACTION;
D O I
10.1109/TASLP.2023.3323856
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
To solve the cocktail party problem in real multi-talker environments, this article proposed a multi-cue guided semi-supervised target speaker separation method (MuSS). Our MuSS integrates three target speaker-related cues, including spatial, visual, and voiceprint cues. Under the guidance of the cues, the target speaker is separated into a predefined output channel, and the interfering sources are separated into other output channels with the optimal permutation. Both synthetic mixtures and real mixtures are utilized for semi-supervised training. Specifically, for synthetic mixtures, the separated target source and other separated interfering sources are trained to reconstruct the ground-truth references, while for real mixtures, the mixture of two real mixtures is fed into our separation model, and the separated sources are remixed to reconstruct the two real mixtures. Besides, in order to facilitate finetuning and evaluating the estimated source on real mixtures, we introduce a real multi-modal speech separation dataset, RealMuSS, which is collected in real-world scenarios and is comprised of more than one hundred hours of multi-talker mixtures with high-quality pseudo references of the target speakers. Experimental results show that the pseudo references effectively improve the finetuning efficiency and enable the model to successfully learn and evaluate estimating speech on real mixtures, and various cue-driven separation models are greatly improved in signal-to-noise ratio and speech recognition accuracy under our semi-supervised learning framework.
引用
收藏
页码:151 / 163
页数:13
相关论文
共 50 条
  • [21] On the Semi-Supervised Learning of Multi-Layered Perceptrons
    Malkin, Jonathan
    Subramanya, Amarnag
    Bilmes, Jeff
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 680 - 683
  • [22] Comprehensive Semi-Supervised Multi-Modal Learning
    Yang, Yang
    Wang, Ke-Tao
    Zhan, De-Chuan
    Xiong, Hui
    Jiang, Yuan
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 4092 - 4098
  • [23] Semi-Supervised Partial Multi-Label Learning
    Xie, Ming-Kun
    Huang, Sheng-Jun
    20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020), 2020, : 691 - 700
  • [24] SAR Target Detection Network via Semi-supervised Learning
    Du Lan
    Wei Di
    Li Lu
    Guo Yuchen
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2020, 42 (01) : 154 - 163
  • [25] OUT-OF-DISTRIBUTION AS A TARGET CLASS IN SEMI-SUPERVISED LEARNING
    Tadros, Antoine
    Drouyer, Sebastien
    von Gioi, Rafael Grompone
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3249 - 3252
  • [26] Data-Uncertainty Guided Multi-Phase Learning for Semi-Supervised Object Detection
    Wang, Zhenyu
    Li, Yali
    Guo, Ye
    Fang, Lu
    Wang, Shengjin
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 4566 - 4575
  • [27] Speaker attribution with voice profiles by graph-based semi-supervised learning
    Wang, Jixuan
    Xiao, Xiong
    Wu, Jian
    Ramamurthy, Ranjani
    Rudzicz, Frank
    Brudno, Michael
    INTERSPEECH 2020, 2020, : 289 - 293
  • [28] Online Semi-supervised Learning for Multi-target Regression in Data Streams Using AMRules
    Sousa, Ricardo
    Gama, Joao
    ADVANCES IN INTELLIGENT DATA ANALYSIS XV, 2016, 9897 : 123 - 133
  • [29] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
    Tu, Tao
    Chen, Yuan-Jui
    Liu, Alexander H.
    Lee, Hung-yi
    INTERSPEECH 2020, 2020, : 3191 - 3195
  • [30] Multi-Augmentation-Based Contrastive Learning for Semi-Supervised Learning
    Wang, Jie
    Yang, Jie
    He, Jiafan
    Peng, Dongliang
    ALGORITHMS, 2024, 17 (03)