Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

被引:13
|
作者
Azizah, Kurniawati [1 ]
Adriani, Mirna [1 ]
Jatmiko, Wisnu [1 ]
机构
[1] Univ Indonesia, Fac Comp Sci, Depok 16424, Indonesia
来源
IEEE ACCESS | 2020年 / 8卷
关键词
Data models; Training data; Training; Machine learning; Phonetics; Speech synthesis; Vocoders; Deep neural network; hierarchical transfer learning; low-resource; multi-speaker; multilingual; style transfer; text-to-speech; TEXT-TO-SPEECH; ALGORITHMS;
D O I
10.1109/ACCESS.2020.3027619
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work applies a hierarchical transfer learning to implement deep neural network (DNN)-based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this article, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio.
引用
收藏
页码:179798 / 179812
页数:15
相关论文
共 50 条
  • [31] A Transformer-Based Approach to Multilingual Fake News Detection in Low-Resource Languages
    De, Arkadipta
    Bandyopadhyay, Dibyanayan
    Gain, Baban
    Ekbal, Asif
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [32] Meta-Transfer Learning for Low-Resource Abstractive Summarization
    Chen, Yi-Syuan
    Shuai, Hong-Han
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 12692 - 12700
  • [33] Low-resource Deep Entity Resolution with Transfer and Active Learning
    Kasai, Jungo
    Qian, Kun
    Gurajada, Sairam
    Li, Yunyao
    Popa, Lucian
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5851 - 5861
  • [34] Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
    Cassano, Federico
    Gouwar, John
    Lucchetti, Francesca
    Schlesinger, Claire
    Freeman, Anders
    Anderson, Carolyn Jane
    Feldman, Molly Q
    Greenberg, Michael
    Jangda, Abhinav
    Guha, Arjun
    [J]. Proceedings of the ACM on Programming Languages, 2024, 8 (OOPSLA2)
  • [35] Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
    Cassano, Federico
    Gouwar, John
    Lucchetti, Francesca
    Schlesinger, Claire
    Freeman, Anders
    Anderson, Carolyn Jane
    Feldman, Molly Q.
    Greenberg, Michael
    Jangda, Abhinav
    Guha, Arjun
    [J]. arXiv, 2023,
  • [36] CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer
    Karlapati, Sri
    Karanasou, Penny
    Lajszczak, Mateusz
    Abbas, Ammar
    Moinet, Alexis
    Makarov, Peter
    Li, Ray
    van Korlaar, Arent
    Slangen, Simon
    Drugman, Thomas
    [J]. INTERSPEECH 2022, 2022, : 3363 - 3367
  • [37] End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning
    Chen, Yuan-Jui
    Tu, Tao
    Yeh, Cheng-chieh
    Lee, Hung-yi
    [J]. INTERSPEECH 2019, 2019, : 2075 - 2079
  • [38] Parameter-Transfer Learning for Low-Resource Individualization of Head-Related Transfer Functions
    Qi, Xiaoke
    Wang, Lu
    [J]. INTERSPEECH 2019, 2019, : 3865 - 3869
  • [39] Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements
    Nesterenko, Anton
    Akhmerov, Ruslan
    Matveeva, Yulia
    Goremykina, Anna
    Astankov, Dmitry
    Shuranov, Evgeniy
    Shirshova, Alexandra
    [J]. SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 508 - 521
  • [40] LOW-RESOURCE LANGUAGE IDENTIFICATION FROM SPEECH USING TRANSFER LEARNING
    Feng, Kexin
    Chaspari, Theodora
    [J]. 2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,