End-to-end text-to-speech synthesis with unaligned multiple language units based on attention

被引:2
|
作者
Aso, Masashi [1 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
来源
关键词
End-to-end; Text-to-speech; Subword; Progressive training; Transformer;
D O I
10.21437/Interspeech.2020-2347
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents the use of unaligned multiple language units for end-to-end text-to-speech (TTS). End-to-end TTS is a promising technology in that it does not require intermediate representation such as prosodic contexts. However, it causes mispronunciation and unnatural prosody. To alleviate this problem, previous methods have used multiple language units, e.g., phonemes and characters, but required the units to be hard-aligned. In this paper, we propose a multi-input attention structure that simultaneously accepts multiple language units without alignments among them. We consider using not only traditional phonemes and characters but also subwords tokenized in a language-independent manner. We also propose a progressive training strategy to deal with the unaligned multiple language units. The experimental results demonstrated that our model and training strategy improve speech quality.
引用
收藏
页码:4009 / 4013
页数:5
相关论文
共 50 条
  • [1] EXPLORING END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS FOR ROMANIAN
    Dumitrache, Marius
    Rebedea, Traian
    PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING, 2020, : 93 - 102
  • [2] Myanmar Text-to-Speech Synthesis Using End-to-End Model
    Qin, Qinglai
    Yang, Jian
    Li, Peiying
    2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 6 - 11
  • [3] End-to-End Mongolian Text-to-Speech System
    Li, Jingdong
    Zhang, Hui
    Liu, Rui
    Zhang, Xueliang
    Bao, Feilong
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 483 - 487
  • [4] Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 31 - 35
  • [5] Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis
    Li, Jingbei
    Wu, Zhiyong
    Li, Runnan
    Zhi, Pengpeng
    Yang, Song
    Meng, Helen
    INTERSPEECH 2019, 2019, : 4494 - 4498
  • [6] End-to-End Thai Text-to-Speech with Linguistic Unit
    Wisetpaitoon, Kontawat
    Singkul, Sattaya
    Sakdejayont, Theerat
    Chalothorn, Tawunrat
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 951 - 959
  • [7] NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality
    Tan, Xu
    Chen, Jiawei
    Liu, Haohe
    Cong, Jian
    Zhang, Chen
    Liu, Yanqing
    Wang, Xi
    Leng, Yichong
    Yi, Yuanhao
    He, Lei
    Zhao, Sheng
    Qin, Tao
    Soong, Frank
    Liu, Tie-Yan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4234 - 4245
  • [8] End-to-End Text-To-Speech synthesis for under resourced South African languages
    Nthite, Thapelo
    Tsoeu, Mohohlo
    2020 INTERNATIONAL SAUPEC/ROBMECH/PRASA CONFERENCE, 2020, : 684 - 689
  • [9] Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training
    Ahmad, Hawraz A.
    Rashid, Tarik A.
    ALGORITHMS, 2024, 17 (07)
  • [10] EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion
    Miao, Chenfeng
    Zhu, Qingying
    Chen, Minchuan
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1650 - 1661