End-to-end text-to-speech synthesis with unaligned multiple language units based on attention

被引：2

作者：

Aso, Masashi ^{[1
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan

来源：

INTERSPEECH 2020 | 2020年

关键词：

End-to-end; Text-to-speech; Subword; Progressive training; Transformer;

D O I：

10.21437/Interspeech.2020-2347

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper presents the use of unaligned multiple language units for end-to-end text-to-speech (TTS). End-to-end TTS is a promising technology in that it does not require intermediate representation such as prosodic contexts. However, it causes mispronunciation and unnatural prosody. To alleviate this problem, previous methods have used multiple language units, e.g., phonemes and characters, but required the units to be hard-aligned. In this paper, we propose a multi-input attention structure that simultaneously accepts multiple language units without alignments among them. We consider using not only traditional phonemes and characters but also subwords tokenized in a language-independent manner. We also propose a progressive training strategy to deal with the unaligned multiple language units. The experimental results demonstrated that our model and training strategy improve speech quality.

引用

页码：4009 / 4013

页数：5

共 50 条

[11] Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans
Fujii, Kazuki
Saito, Yuki
Saruwatari, Hiroshi
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1702 - 1707
[12] Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans
Fujii, Kazuki
Saito, Yuki
Saruwatari, Hiroshi
Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022, 2022, : 1702 - 1707
[13] Effective Emotion Transplantation in an End-to-End Text-to-Speech System
Joo, Young-Sun
Bae, Hanbin
Kim, Young-Ik
Cho, Hoon-Young
Kang, Hong-Goo
IEEE ACCESS, 2020, 8 : 161713 - 161719
[14] FPETS : Fully Parallel End-to-End Text-to-Speech System
Ma, Dabiao
Su, Zhiba
Wang, Wenxuan
Lu, Yuhao
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8457 - 8463
[15] Myanmar Text-to-Speech System based on Tacotron (End-to-End Generative Model)
Win, Yuzana
Lwin, Htoo Pyae
Masada, Tomonari
11TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE: DATA, NETWORK, AND AI IN THE AGE OF UNTACT (ICTC 2020), 2020, : 572 - 577
[16] Multi speaker text-to-speech synthesis using generalized end-to-end loss function
Nazir, Owais
Malik, Aruna
Singh, Samayveer
Pathan, Al-Sakib Khan
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 64205 - 64222
[17] Improvement of the end-to-end scene text recognition method for "text-to-speech" conversion
Makhmudov, Fazliddin
Mukhiddinov, Mukhriddin
Abdusalomov, Akmalbek
Avazov, Kuldoshbay
Khamdamov, Utkir
Cho, Young Im
INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2020, 18 (06)
[18] WAVE-TACOTRON: SPECTROGRAM-FREE END-TO-END TEXT-TO-SPEECH SYNTHESIS
Weiss, Ron J.
Skerry-Ryan, R. J.
Battenberg, Eric
Mariooryad, Soroosh
Kingma, Diederik P.
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5679 - 5683
[19] END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE
Yasuda, Yusuke
Wang, Xin
Yamagishi, Junichi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5694 - 5698
[20] Investigation of Input Alphabets of End-to-End Lithuanian Text-to-Speech Synthesizer
Kasparaitis, Pijus
Antanavicius, Danielius
BALTIC JOURNAL OF MODERN COMPUTING, 2023, 11 (02): : 285 - 296

← 1 2 3 4 5 →