Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages

被引:13
|
作者
Azizah, Kurniawati [1 ]
Adriani, Mirna [1 ]
Jatmiko, Wisnu [1 ]
机构
[1] Univ Indonesia, Fac Comp Sci, Depok 16424, Indonesia
来源
IEEE ACCESS | 2020年 / 8卷
关键词
Data models; Training data; Training; Machine learning; Phonetics; Speech synthesis; Vocoders; Deep neural network; hierarchical transfer learning; low-resource; multi-speaker; multilingual; style transfer; text-to-speech; TEXT-TO-SPEECH; ALGORITHMS;
D O I
10.1109/ACCESS.2020.3027619
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This work applies a hierarchical transfer learning to implement deep neural network (DNN)-based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system typically requires a large amount of training data. In recent years, while DNN-based TTS has made remarkable results for high-resource languages, it still suffers from a data scarcity problem for low-resource languages. In this article, we propose a multi-stage transfer learning strategy to train our TTS model for low-resource languages. We make use of a high-resource language and a joint multilingual dataset of low-resource languages. A pre-trained monolingual TTS on the high-resource language is fine-tuned on the low-resource language using the same model architecture. Then, we apply partial network-based transfer learning from the pre-trained monolingual TTS to a multilingual TTS and finally from the pre-trained multilingual TTS to a multilingual with style transfer TTS. Our experiment on Indonesian, Javanese, and Sundanese languages show adequate quality of synthesized speech. The evaluation of our multilingual TTS reaches a mean opinion score (MOS) of 4.35 for Indonesian (ground truth = 4.36). Whereas for Javanese and Sundanese it reaches a MOS of 4.20 (ground truth = 4.38) and 4.28 (ground truth = 4.20), respectively. For parallel style transfer evaluation, our TTS model reaches an F0 frame error (FFE) of 9.08%, 10.13%, and 8.43% for Indonesian, Javanese, and Sundanese, respectively. The results indicate that the proposed strategy can be effectively applied to the low-resource languages target domain. With a small amount of training data, our models are able to learn step by step from a smaller TTS network to larger networks, produce intelligible speech approaching the real human voice, and successfully transfer speaking style from a reference audio.
引用
收藏
页码:179798 / 179812
页数:15
相关论文
共 50 条
  • [1] Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages
    Azizah, Kurniawati
    Jatmiko, Wisnu
    [J]. IEEE ACCESS, 2022, 10 : 5895 - 5911
  • [2] MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
    Fan, Yuchen
    Qian, Yao
    Soong, Frank K.
    He, Lei
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4475 - 4479
  • [3] Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS
    Udagawa, Kenta
    Saito, Yuki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2022, 2022, : 2968 - 2972
  • [4] Multi-speaker TTS system for low-resource language using cross-lingual transfer learning and data augmentation
    Byambadorj, Zolzaya
    Nishimura, Ryota
    Ayush, Altangerel
    Ohta, Kengo
    Kitaoka, Norihide
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 849 - 853
  • [5] Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech
    Jeong, Myeonghun
    Kim, Minchan
    Choi, Byoung Jin
    Yoon, Jaesam
    Jang, Won
    Kim, Nam Soo
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1519 - 1530
  • [6] TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [7] Transfer Learning for Low-Resource Multilingual Relation Classification
    Nag, Arijit
    Samanta, Bidisha
    Mukherjee, Animesh
    Ganguly, Niloy
    Chakrabarti, Soumen
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (02)
  • [8] Few-shot Controllable Style Transfer for Low-Resource Multilingual Settings
    Krishna, Kalpesh
    Nathani, Deepak
    Garcia, Xavier
    Samanta, Bidisha
    Talukdar, Partha
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7439 - 7468
  • [9] Multi-Round Transfer Learning for Low-Resource NMT Using Multiple High-Resource Languages
    Maimaiti, Mieradilijiang
    Liu, Yang
    Luan, Huanbo
    Sun, Maosong
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2019, 18 (04)
  • [10] Hierarchical Transfer Learning Architecture for Low-Resource Neural Machine Translation
    Luo, Gongxu
    Yang, Yating
    Yuan, Yang
    Chen, Zhanheng
    Ainiwaer, Aizimaiti
    [J]. IEEE ACCESS, 2019, 7 : 154157 - 154166