End-to-end text-to-speech synthesis with unaligned multiple language units based on attention

被引:2
|
作者
Aso, Masashi [1 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
来源
关键词
End-to-end; Text-to-speech; Subword; Progressive training; Transformer;
D O I
10.21437/Interspeech.2020-2347
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents the use of unaligned multiple language units for end-to-end text-to-speech (TTS). End-to-end TTS is a promising technology in that it does not require intermediate representation such as prosodic contexts. However, it causes mispronunciation and unnatural prosody. To alleviate this problem, previous methods have used multiple language units, e.g., phonemes and characters, but required the units to be hard-aligned. In this paper, we propose a multi-input attention structure that simultaneously accepts multiple language units without alignments among them. We consider using not only traditional phonemes and characters but also subwords tokenized in a language-independent manner. We also propose a progressive training strategy to deal with the unaligned multiple language units. The experimental results demonstrated that our model and training strategy improve speech quality.
引用
收藏
页码:4009 / 4013
页数:5
相关论文
共 50 条
  • [31] End-to-End Automatic Speech Recognition with a Reconstruction Criterion Using Speech-to-Text and Text-to-Speech Encoder-Decoders
    Masumura, Ryo
    Sato, Hiroshi
    Tanaka, Tomohiro
    Moriya, Takafumi
    Ijima, Yusuke
    Oba, Takanobu
    INTERSPEECH 2019, 2019, : 1606 - 1610
  • [32] A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning
    Oyucu, Saadin
    ELECTRONICS, 2023, 12 (08)
  • [33] Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech
    Chung, Hyunseung
    Lee, Sang-Hoon
    Lee, Seong-Whan
    INTERSPEECH 2021, 2021, : 3635 - 3639
  • [34] On the localness modeling for the self-attention based end-to-end speech synthesis
    Yang, Shan
    Lu, Heng
    Kang, Shiyin
    Xue, Liumeng
    Xiao, Jinba
    Su, Dan
    Xie, Lei
    Yu, Dong
    NEURAL NETWORKS, 2020, 125 : 121 - 130
  • [35] On the localness modeling for the self-attention based end-to-end speech synthesis
    Yang, Shan
    Lu, Heng
    Kang, Shiyin
    Xue, Liumeng
    Xiao, Jinba
    Su, Dan
    Xie, Lei
    Yu, Dong
    Neural Networks, 2020, 125 : 121 - 130
  • [36] Location-Based End-to-End Speech Recognition with Multiple Language Models
    Lin, Zhijie
    Lin, Kaiyang
    Chen, Shiling
    Li, Linlin
    Zhao, Zhou
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9975 - 9976
  • [37] Speaker Adaptation Experiments with Limited Data for End-to-End Text-To-Speech Synthesis using Tacotron2
    Mandeel, Ali Raheem
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    INFOCOMMUNICATIONS JOURNAL, 2022, 14 (03): : 55 - 62
  • [38] Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units
    Xiao, Zhangyu
    Ou, Zhijian
    Chu, Wei
    Lin, Hui
    2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 146 - 150
  • [39] Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech
    Kakegawa, Naoto
    Hara, Sunao
    Abe, Masanobu
    Ijima, Yusuke
    INTERSPEECH 2021, 2021, : 126 - 130
  • [40] Boosting subjective quality of Arabic text-to-speech (TTS) using end-to-end deep architecture
    Fahmy, Fady K.
    Abbas, Hazem M.
    Khalil, Mahmoud, I
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2022, 25 (01) : 79 - 88