ON PERMUTATION INVARIANT TRAINING FOR SPEECH SOURCE SEPARATION

被引:0
|
作者
Liu, Xiaoyu [1 ]
Pons, Jordi [1 ]
机构
[1] Dolby Labs, San Francisco, CA 94103 USA
关键词
Speech source separation; permutation invariant training; waveform-based models; spectrogram-based models; FILTERBANK;
D O I
10.1109/ICASSP39728.2021.9413559
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFT-based models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.
引用
收藏
页码:6 / 10
页数:5
相关论文
共 50 条
  • [1] Probabilistic Permutation Invariant Training for Speech Separation
    Yousefi, Midia
    Khorram, Soheil
    Hansen, John H. L.
    [J]. INTERSPEECH 2019, 2019, : 4604 - 4608
  • [2] INTERRUPTED AND CASCADED PERMUTATION INVARIANT TRAINING FOR SPEECH SEPARATION
    Yang, Gene-Ping
    Wu, Szu-Lin
    Mao, Yao-Wen
    Lee, Hung-yi
    Lee, Lin-shah
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6369 - 6373
  • [3] Permutation Invariant Training of Generative Adversarial Network for Monaural Speech Separation
    Chen, Lianwu
    Yu, Meng
    Qian, Yanmin
    Su, Dan
    Yu, Dong
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 302 - 306
  • [4] Multi-talker Speech Separation Based on Permutation Invariant Training and Beamforming
    Yin, Lu
    Wang, Ziteng
    Xia, Risheng
    Li, Junfeng
    Yan, Yonghong
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 851 - 855
  • [5] MixCycle: Unsupervised Speech Separation via Cyclic Mixture Permutation Invariant Training
    Karamatli, Ertug
    Kirbiz, Serap
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 2637 - 2641
  • [6] ATTENTIONPIT: SOFT PERMUTATION INVARIANT TRAINING FOR AUDIO SOURCE SEPARATION WITH ATTENTION MECHANISM
    Kameoka, Hirokazu
    Seki, Shogo
    Li, Li
    Watanabe, Chihiro
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 706 - 710
  • [7] Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation
    Fan, Cunhang
    Liu, Bin
    Tao, Jianhua
    Wen, Zhengqi
    Yi, Jiangyan
    Bai, Ye
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 26 - 30
  • [8] Single-channel speech separation using soft-minimum permutation invariant training
    Yousefi, Midia
    Hansen, John H. L.
    [J]. SPEECH COMMUNICATION, 2023, 151 : 76 - 85
  • [9] SINGLE CHANNEL SPEECH SEPARATION WITH CONSTRAINED UTTERANCE LEVEL PERMUTATION INVARIANT TRAINING USING GRID LSTM
    Xu, Chenglin
    Rao, Wei
    Xiao, Xiong
    Chng, Eng Siong
    Li, Haizhou
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6 - 10
  • [10] Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
    Kolbaek, Morten
    Yu, Dong
    Tan, Zheng-Hua
    Jensen, Jesper
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (10) : 1901 - 1913