A STUDY ON NEURAL-NETWORK-BASED TEXT-TO-SPEECH ADAPTATION TECHNIQUES FOR VIETNAMESE

被引:0
|
作者
Pham Ngoc Phuong [1 ]
Chung Tran Quang [2 ]
Quoc Truong Do [2 ]
Mai Chi Luong [3 ]
机构
[1] Thai Nguyen Univ, Thai Nguyen, Vietnam
[2] Vietnam Artificial Intelligence Solut, VAIS, Hanoi, Vietnam
[3] Vietnam Acad Sci & Technol, Inst Informat Technol, Hanoi, Vietnam
关键词
Speaker adaptation; Multi-pass fine-tune; TTS adaptation; Vietnamese TTS corpus; SPEAKER ADAPTATION;
D O I
10.1109/O-COCOSDA202152914.2021.9660445
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
One of the main goals of text-to-speech adaptation techniques is to produce a model that can generate good quality audio given a small amount of training data. In fact, TTS systems for rich-resource languages have good quality because of a large amount of data, but training models with small datasets (or low-resources) is not an easy task, which often produces low-quality sounds. One of the approaches to overcome the data limitation is fine-tuning. However, we still need a pretrained model which learns from large amount of data in advance. The paper presents two contributions: (1) a study on the amounts of data needed for a traditional fine-tuning method for Vietnamese, where we change the data and run the training for a few more iterations; (2) we present a new fine-tuning pipeline which allows us to borrow a pre-trained model from English and adapt it to any Vietnamese voices with a very small amount of data while still maintaining a good speech synthetic sound. Our experiments show that with only 4 minutes of data, we can synthesize a new voice with a good similarity score, and with 16 minutes of data, the model can generate audio with a 3.8 MOS score.
引用
收藏
页码:199 / 205
页数:7
相关论文
共 50 条
  • [21] Neural-network-based metalearning for distributed text information retrieval
    Lai, Kin Keung
    Yu, Lean
    Wang, Shouyang
    Huang, Wei
    2006 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORK PROCEEDINGS, VOLS 1-10, 2006, : 1302 - 1309
  • [22] Optimisation of artificial neural network topology applied in the prosody control in text-to-speech synthesis
    Sebesta, V
    Tucková, J
    SOFSEM 2000: THEORY AND PRACTICE OF INFORMATICS, 2000, 1963 : 420 - 430
  • [23] Predication of prosodic data in Persian text-to-speech systems using recurrent neural network
    Farrokhi, A
    Ghaemmaghami, S
    ELECTRONICS LETTERS, 2003, 39 (25) : 1868 - 1869
  • [24] Neural networks in text-to-speech systems for the Greek language
    Falas, T
    Stafylopatis, AG
    MELECON 2000: INFORMATION TECHNOLOGY AND ELECTROTECHNOLOGY FOR THE MEDITERRANEAN COUNTRIES, VOLS 1-3, PROCEEDINGS, 2000, : 574 - 577
  • [25] Comparative Study of Text-to-Speech Synthesis Techniques for Mobile Linguistic Translation Process
    Chomwihoke, Phanchita
    Phankokkruad, Manop
    2014 IEEE INTERNATIONAL CONFERENCE ON CONTROL SYSTEM COMPUTING AND ENGINEERING, 2014, : 449 - 454
  • [26] Decoding Knowledge Transfer for Neural Text-to-Speech Training
    Liu, Rui
    Sisman, Berrak
    Gao, Guanglai
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1789 - 1802
  • [27] Integrating coding techniques into LP-based Mandarin text-to-speech synthesis
    Hu H.-T.
    Wang H.-M.
    Int J Speech Technol, 2007, 1 (31-44): : 31 - 44
  • [28] ADAPTATION OF RNN TRANSDUCER WITH TEXT-TO-SPEECH TECHNOLOGY FOR KEYWORD SPOTTING
    Sharma, Eva
    Ye, Guoli
    Wei, Wenning
    Zhao, Rui
    Tian, Yao
    Wu, Jian
    He, Lei
    Lin, Ed
    Gong, Yifan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7484 - 7488
  • [29] Automatic prosodic modeling for speaker and task adaptation in text-to-speech
    LopezGonzalo, E
    RodriguezGarcia, JM
    HernandezGomez, L
    Villar, JM
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 927 - 930
  • [30] A novel prosody adaptation method for Mandarin concatenation-based text-to-speech system
    National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China
    Acoust. Sci. Technol., 1 (33-41):