Hierarchical Transfer Learning for Text-to-Speech in Indonesian, Java']Javanese, and Sundanese Languages

被引：0

作者：

Azizah, Kurniawati ^{[1
]}

Adriani, Mirna ^{[1
]}

机构：

[1] Univ Indonesia, Fac Comp Sci, Depok, Indonesia

来源：

ICACSIS 2020: 2020 12TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS) | 2020年

关键词：

deep learning; hierarchical transfer learning; low-resource problem; Indonesian; !text type='Java']Java[!/text]nese; Sundanese; text-to-speech; ALGORITHMS;

D O I：

10.1109/icacsis51025.2020.9263086

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This research develops end-to-end deep learning-based text-to-speech (TTS) in Indonesian, Javanese, and Sundanese. While end-to-end neural TTS, such as Tacotron-2, has made remarkable progress recently, it still suffers from a data scarcity problem for low-resource languages such as Javanese and Sundanese. Our preliminary study shows that Tacotron-2-based TTS needs a large amount of training data; a minimum of 10 hours of training data is required for the model to be able to synthesize acceptable quality and intelligible speech. To solve this low-resource problem, our work proposes a hierarchical transfer learning to train TTS for Javanese and Sundanese, by taking advantage of a dissimilar high-resource language of English domain and a similar intermediate-resource language of Indonesian domain. We report that the evaluation of synthesized speech using the mean opinion score (MOS) reaches 4.27 for Indonesian, and 4.08 for Javanese, and 3.92 for Sundanese. The word accuracy (WAcc) evaluation on semantically unpredicted sentences (SUS) reaches 98.26% for Indonesian, 95.02% for Javanese, and 95.43% for Sundanese. The subjective evaluations of the synthetic speech quality demonstrate that our transfer learning scheme is successfully applied to TTS model for low-resource target domain. Using less than one hour of training data, 38 minutes for Indonesian, 16 minutes for Javanese, and 19 minutes for Sundanese, TTS models can learn fast and achieve adequate performance.

引用

页码：421 / 428

页数：8

共 50 条

[41] Conditional Random Fields for Hierarchical Segment Selection in Text-to-Speech Synthesis
Weiss, Christian
Hess, Wolfgang
[J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 2026 - 2029
[42] An Approach to Building Language-Independent Text-to-Speech Synthesis for Indian Languages
Prakash, Anusha
Reddy, M. Ramasubba
Nagarajan, T.
Murthy, Hema A.
[J]. 2014 TWENTIETH NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2014,
[43] Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement
Roosadi, Hizkia Raditya Pratama
Ginanjar, Rizki Rivai
Lestari, Dessi Puji
[J]. IEEE Access, 2024, 12 : 193131 - 193140
[44] ON-THE-FLY DATA AUGMENTATION FOR TEXT-TO-SPEECH STYLE TRANSFER
Chung, Raymond
Mak, Brian
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 634 - 641
[45] Bangla text normalization for text-to-speech synthesizer using machine learning algorithms
Islam, Md. Rezaul
Ahmad, Arif
Rahman, Mohammad Shahidur
[J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (01)
[46] ICA-based hierarchical text classification for multi-domain text-to-speech synthesis
Sevillano, X
Alías, F
Socoró, JC
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 697 - 700
[47] PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH
Karlapati, Sri
Abbas, Ammar
Hodari, Zack
Moinet, Alexis
Joly, Arnaud
Karanasou, Penny
Drugman, Thomas
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6573 - 6577
[48] Multilingual context-based pronunciation learning for Text-to-Speech
Comini, Giulia
Ribeiro, Manuel Sam
Yang, Fan
Shim, Heereen
Lorenzo-Trueba, Jaime
[J]. INTERSPEECH 2023, 2023, : 631 - 635
[49] Text-to-Speech Software and Learning: Investigating the Relevancy of the Voice Effect
Craig, Scotty D.
Schroeder, Noah L.
[J]. JOURNAL OF EDUCATIONAL COMPUTING RESEARCH, 2019, 57 (06) : 1534 - 1548
[50] Text-To-Speech based dictation platform for students with learning difficulties
Oumaima, Zine
Abdelouafi, Meziane
[J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS: THEORIES AND APPLICATIONS (SITA'18), 2018,

← 1 2 3 4 5 →