Hierarchical Transfer Learning for Text-to-Speech in Indonesian, Java']Javanese, and Sundanese Languages

被引:0
|
作者
Azizah, Kurniawati [1 ]
Adriani, Mirna [1 ]
机构
[1] Univ Indonesia, Fac Comp Sci, Depok, Indonesia
关键词
deep learning; hierarchical transfer learning; low-resource problem; Indonesian; !text type='Java']Java[!/text]nese; Sundanese; text-to-speech; ALGORITHMS;
D O I
10.1109/icacsis51025.2020.9263086
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This research develops end-to-end deep learning-based text-to-speech (TTS) in Indonesian, Javanese, and Sundanese. While end-to-end neural TTS, such as Tacotron-2, has made remarkable progress recently, it still suffers from a data scarcity problem for low-resource languages such as Javanese and Sundanese. Our preliminary study shows that Tacotron-2-based TTS needs a large amount of training data; a minimum of 10 hours of training data is required for the model to be able to synthesize acceptable quality and intelligible speech. To solve this low-resource problem, our work proposes a hierarchical transfer learning to train TTS for Javanese and Sundanese, by taking advantage of a dissimilar high-resource language of English domain and a similar intermediate-resource language of Indonesian domain. We report that the evaluation of synthesized speech using the mean opinion score (MOS) reaches 4.27 for Indonesian, and 4.08 for Javanese, and 3.92 for Sundanese. The word accuracy (WAcc) evaluation on semantically unpredicted sentences (SUS) reaches 98.26% for Indonesian, 95.02% for Javanese, and 95.43% for Sundanese. The subjective evaluations of the synthetic speech quality demonstrate that our transfer learning scheme is successfully applied to TTS model for low-resource target domain. Using less than one hour of training data, 38 minutes for Indonesian, 16 minutes for Javanese, and 19 minutes for Sundanese, TTS models can learn fast and achieve adequate performance.
引用
收藏
页码:421 / 428
页数:8
相关论文
共 50 条
  • [41] Conditional Random Fields for Hierarchical Segment Selection in Text-to-Speech Synthesis
    Weiss, Christian
    Hess, Wolfgang
    [J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 2026 - 2029
  • [42] An Approach to Building Language-Independent Text-to-Speech Synthesis for Indian Languages
    Prakash, Anusha
    Reddy, M. Ramasubba
    Nagarajan, T.
    Murthy, Hema A.
    [J]. 2014 TWENTIETH NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2014,
  • [43] Indonesian Voice Cloning Text-to-Speech System With Vall-E-Based Model and Speech Enhancement
    Roosadi, Hizkia Raditya Pratama
    Ginanjar, Rizki Rivai
    Lestari, Dessi Puji
    [J]. IEEE Access, 2024, 12 : 193131 - 193140
  • [44] ON-THE-FLY DATA AUGMENTATION FOR TEXT-TO-SPEECH STYLE TRANSFER
    Chung, Raymond
    Mak, Brian
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 634 - 641
  • [45] Bangla text normalization for text-to-speech synthesizer using machine learning algorithms
    Islam, Md. Rezaul
    Ahmad, Arif
    Rahman, Mohammad Shahidur
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2024, 36 (01)
  • [46] ICA-based hierarchical text classification for multi-domain text-to-speech synthesis
    Sevillano, X
    Alías, F
    Socoró, JC
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 697 - 700
  • [47] PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH
    Karlapati, Sri
    Abbas, Ammar
    Hodari, Zack
    Moinet, Alexis
    Joly, Arnaud
    Karanasou, Penny
    Drugman, Thomas
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6573 - 6577
  • [48] Multilingual context-based pronunciation learning for Text-to-Speech
    Comini, Giulia
    Ribeiro, Manuel Sam
    Yang, Fan
    Shim, Heereen
    Lorenzo-Trueba, Jaime
    [J]. INTERSPEECH 2023, 2023, : 631 - 635
  • [49] Text-to-Speech Software and Learning: Investigating the Relevancy of the Voice Effect
    Craig, Scotty D.
    Schroeder, Noah L.
    [J]. JOURNAL OF EDUCATIONAL COMPUTING RESEARCH, 2019, 57 (06) : 1534 - 1548
  • [50] Text-To-Speech based dictation platform for students with learning difficulties
    Oumaima, Zine
    Abdelouafi, Meziane
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS: THEORIES AND APPLICATIONS (SITA'18), 2018,