Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling

被引:7
|
作者
Saito, Yuki [1 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan
关键词
Training; Speech synthesis; Acoustics; Prediction algorithms; Feature extraction; Adaptation models; Controllability; Deep speaker representation learning; active learning; multi-speaker generative modeling; perceptual speaker similarity; speaker embedding; SPEECH SYNTHESIS;
D O I
10.1109/TASLP.2021.3059114
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose novel deep speaker representation learning that considers perceptual similarity among speakers for multi-speaker generative modeling. Following its success in accurate discriminative modeling of speaker individuality, knowledge of deep speaker representation learning (i.e., speaker representation learning using deep neural networks) has been introduced to multi-speaker generative modeling. However, the conventional discriminative algorithm does not necessarily learn speaker embeddings suitable for such generative modeling, which may result in lower quality and less controllability of synthetic speech. We propose three representation learning algorithms that utilize a perceptual speaker similarity matrix obtained by large-scale perceptual scoring of speaker-pair similarity. The algorithms train a speaker encoder to learn speaker embeddings with three different representations of the matrix: a set of vectors, the Gram matrix, and a graph. Furthermore, we propose an active learning algorithm that iterates the perceptual scoring and speaker encoder training. To obtain accurate embeddings while reducing costs of scoring and training, the algorithm selects unscored speaker-pairs to be scored next on the basis of the sequentially-trained speaker encoder's similarity prediction results. Experimental evaluation results show that 1) the proposed representation learning algorithms learn speaker embeddings strongly correlated with perceptual speaker-pair similarity, 2) the embeddings improve synthetic speech quality in speech autoencoding tasks better than conventional d-vectors learned by discriminative modeling, 3) the proposed active learning algorithm achieves higher synthetic speech quality while reducing costs of scoring and training, and 4) among the proposed similarity {vector, matrix, graph} embedding algorithms, the first achieves the best speaker similarity for synthetic speech and the third gives the most improvement in the synthetic speech naturalness.
引用
收藏
页码:1033 / 1048
页数:16
相关论文
共 50 条
  • [1] Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation
    Mitsui, Kentaro
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. SPEECH COMMUNICATION, 2021, 132 : 132 - 145
  • [2] Speaker conditioned acoustic modeling for multi-speaker conversational ASR
    Chetupalli, Srikanth Raj
    Ganapathy, Sriram
    [J]. INTERSPEECH 2022, 2022, : 3834 - 3838
  • [3] MULTI-SCENARIO DEEP LEARNING FOR MULTI-SPEAKER SOURCE SEPARATION
    Zegers, Jeroen
    Van Hamme, Hugo
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5379 - 5383
  • [4] INVESTIGATION OF FAST AND EFFICIENT METHODS FOR MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION
    Zheng, Yibin
    Li, Xinhui
    Lu, Li
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6618 - 6622
  • [5] MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
    Fan, Yuchen
    Qian, Yao
    Soong, Frank K.
    He, Lei
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4475 - 4479
  • [6] Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling
    Barnard, Mark
    Koniusz, Peter
    Wang, Wenwu
    Kittler, Josef
    Naqvi, Syed Mohsen
    Chambers, Jonathon
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2014, 16 (03) : 864 - 880
  • [7] A DEEP REINFORCEMENT LEARNING APPROACH TO AUDIO-BASED NAVIGATION IN A MULTI-SPEAKER ENVIRONMENT
    Giannakopoulos, Petros
    Pikrakis, Aggelos
    Cotronis, Yannis
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3475 - 3479
  • [8] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
    Denisov, Pavel
    Ngoc Thang Vu
    [J]. INTERSPEECH 2019, 2019, : 4425 - 4429
  • [9] An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis
    Lorincz, Beata
    Stan, Adriana
    Giurgiu, Mircea
    [J]. KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 756 - 765
  • [10] Single and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features
    Barrington, Sarah
    Barua, Romit
    Koorma, Gautham
    Farid, Hany
    [J]. 2023 IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY, WIFS, 2023,