PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION

被引:0
|
作者
Fu, Ruibo [1 ,2 ]
Tao, Jianhua [1 ,2 ,3 ]
Wen, Zhengqi [1 ]
Zheng, Yibin [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
speech synthesis; speaker adaptation; speaker embedding; phoneme representation; SYNTHESIS SYSTEM; NEURAL-NETWORKS;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents an architecture to perform speaker adaption in long short-term memory (LSTM) based Mandarin statistical parametric speech synthesis system. Compared with the conventional methods that focused on using fixed global speaker representations in utterance level for speaker recognition task, the proposed method extracts speaker representations in utterance and phoneme level, which can describe more pronunciation characteristics in phoneme level. And an attention mechanism is deployed to combine each level representations dynamically to train a task-specific phoneme dependent speaker embedding. To handle the unbalanced database and avoid over-fitting, the model is factored into an average model and an adaptation model and combined by an attention mechanism. We investigate the performance of speaker representations extracted by different methods. Experimental results confirm the adaptability of our proposed speaker embedding and model factorization structure. And listening tests demonstrate that our proposed method can achieve better adaptation performance than baselines in terms of naturalness and speaker similarity.
引用
收藏
页码:6930 / 6934
页数:5
相关论文
共 50 条
  • [41] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
    Kumar, Neeraj
    Narang, Ankur
    Lall, Brejesh
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
  • [42] Improving Multi-Speaker Tacotron with Speaker Gating Mechanisms
    Zhao, Wei
    Xu, Li
    He, Ting
    [J]. PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 7498 - 7503
  • [43] A hybrid approach to speaker recognition in multi-speaker environment
    Trivedi, J
    Maitra, A
    Mitra, SK
    [J]. PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2005, 3776 : 272 - 275
  • [44] Automatic speaker clustering from multi-speaker utterances
    McLaughlin, J
    Reynolds, D
    Singer, E
    O'Leary, GC
    [J]. ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 817 - 820
  • [45] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
    Denisov, Pavel
    Ngoc Thang Vu
    [J]. INTERSPEECH 2019, 2019, : 4425 - 4429
  • [46] Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
    Luong, Hieu-Thi
    Wang, Xin
    Yamagishi, Junichi
    Nishizawa, Nobuyuki
    [J]. INTERSPEECH 2019, 2019, : 1303 - 1307
  • [47] Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization
    Lung, Jensen Wong Jing
    Salam, Md Sah Hj
    Rehman, Amjad
    Rahim, Mohd Shafry Mohd
    Saba, Tanzila
    [J]. IETE TECHNICAL REVIEW, 2014, 31 (02) : 128 - 136
  • [48] LCMV BEAMFORMING WITH SUBSPACE PROJECTION FOR MULTI-SPEAKER SPEECH ENHANCEMENT
    Hassani, Amin
    Bertrand, Alexander
    Moonen, Marc
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 91 - 95
  • [49] End-to-End Multilingual Multi-Speaker Speech Recognition
    Seki, Hiroshi
    Hori, Takaaki
    Watanabe, Shinji
    Le Roux, Jonathan
    Hershey, John R.
    [J]. INTERSPEECH 2019, 2019, : 3755 - 3759
  • [50] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138