Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder

被引:0
|
作者
Hao, Xiaoyang [1 ,2 ]
Zhang, Pengyuan [1 ,2 ]
机构
[1] Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing,100190, China
[2] University of Chinese Academy of Sciences, Beijing,100049, China
来源
Shengxue Xuebao/Acta Acustica | 2022年 / 47卷 / 03期
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Speaker adaption and speaker labels are two common methods for multi-speaker speech synthesis. The model obtained by speaker adaption can only support the speech of the adaptive speaker, and not robust enough. The conventional speaker label needs to obtain the speaker information of speech with supervision, and can't learn the speaker label unsupervised from the speech itself. In order to solve the problems, a variational autoencoder based autoregressive multi-speaker framework is proposed. Firstly, speaker information is learned by variational autoencoder unsupervisedly and encoded into speaker labels. Then, speaker labels together with linguistic features are fed into an autoregressive acoustic model. Besides, acoustic model adopts multi-task learning to avoid the over-fitting of fundamental frequency. Pre-experiment shows, the autoregressive network structure decreases the cepstral distortion by 1.018 dB and root mean square error of fundamental frequency drops 6.861 Hz by multi-task learning. In the following comparative experiments, the Mean Opinion Score (MOS) scores respectively achieve 3.71, 3.55,3.15 and Pinyin Error Rate achieve 6.71%, 7.54%, 9.87% in three sub-tasks in multi-speaker speech synthesis by proposed method, which shows proposed methods observably improve the quality of synthesized speech. © 2022 Acta Acustica.
引用
收藏
页码:405 / 416
相关论文
共 50 条
  • [21] MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
    Fan, Yuchen
    Qian, Yao
    Soong, Frank K.
    He, Lei
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4475 - 4479
  • [22] J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
    Takamichi, Shinnosuke
    Nakata, Wataru
    Tanji, Naoko
    Saruwatari, Hiroshi
    [J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, 2022-September : 2358 - 2362
  • [23] GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
    Yang, Jinhyeok
    Bae, Jae-Sung
    Bak, Taejun
    Kim, Young-Ik
    Cho, Hoon-Young
    [J]. INTERSPEECH 2021, 2021, : 2202 - 2206
  • [24] Emotional Speech Synthesis for Multi-Speaker Emotional Dataset Using WaveNet Vocoder
    Choi, Heejin
    Park, Sangjun
    Park, Jinuk
    Hahn, Minsoo
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS (ICCE), 2019,
  • [25] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION
    Settle, Shane
    Le Roux, Jonathan
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4819 - 4823
  • [26] Multichannel Variational Autoencoder-Based Speech Separation in Designated Speaker Order
    Liao, Lele
    Cheng, Guoliang
    Ruan, Haoxin
    Chen, Kai
    Lu, Jing
    [J]. SYMMETRY-BASEL, 2022, 14 (12):
  • [27] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
    Kumar, Neeraj
    Narang, Ankur
    Lall, Brejesh
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
  • [28] TOWARDS MULTI-SPEAKER UNSUPERVISED SPEECH PATTERN DISCOVERY
    Zhang, Yaodong
    Glass, James R.
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4366 - 4369
  • [29] MULTI-SPEAKER, NARROWBAND, CONTINUOUS MARATHI SPEECH DATABASE
    Godambe, Tejas
    Bondale, Nandini
    Samudravijaya, K.
    Rao, Preeti
    [J]. 2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [30] Speech Recognition and Multi-Speaker Diarization of Long Conversations
    Mao, Huanru Henry
    Li, Shuyang
    McAuley, Julian
    Cottrell, Garrison W.
    [J]. INTERSPEECH 2020, 2020, : 691 - 695