A STUDY ON NEURAL-NETWORK-BASED TEXT-TO-SPEECH ADAPTATION TECHNIQUES FOR VIETNAMESE

被引:0
|
作者
Pham Ngoc Phuong [1 ]
Chung Tran Quang [2 ]
Quoc Truong Do [2 ]
Mai Chi Luong [3 ]
机构
[1] Thai Nguyen Univ, Thai Nguyen, Vietnam
[2] Vietnam Artificial Intelligence Solut, VAIS, Hanoi, Vietnam
[3] Vietnam Acad Sci & Technol, Inst Informat Technol, Hanoi, Vietnam
关键词
Speaker adaptation; Multi-pass fine-tune; TTS adaptation; Vietnamese TTS corpus; SPEAKER ADAPTATION;
D O I
10.1109/O-COCOSDA202152914.2021.9660445
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
One of the main goals of text-to-speech adaptation techniques is to produce a model that can generate good quality audio given a small amount of training data. In fact, TTS systems for rich-resource languages have good quality because of a large amount of data, but training models with small datasets (or low-resources) is not an easy task, which often produces low-quality sounds. One of the approaches to overcome the data limitation is fine-tuning. However, we still need a pretrained model which learns from large amount of data in advance. The paper presents two contributions: (1) a study on the amounts of data needed for a traditional fine-tuning method for Vietnamese, where we change the data and run the training for a few more iterations; (2) we present a new fine-tuning pipeline which allows us to borrow a pre-trained model from English and adapt it to any Vietnamese voices with a very small amount of data while still maintaining a good speech synthetic sound. Our experiments show that with only 4 minutes of data, we can synthesize a new voice with a good similarity score, and with 16 minutes of data, the model can generate audio with a 3.8 MOS score.
引用
收藏
页码:199 / 205
页数:7
相关论文
共 50 条
  • [1] NEURAL-NETWORK-BASED F0 TEXT-TO-SPEECH SYNTHESIZER FOR MANDARINE
    HWANG, SH
    CHEN, SH
    IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 1994, 141 (06): : 384 - 390
  • [2] Neural-network-based HMM adaptation for noisy speech
    Furui, S
    Itoh, D
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING - VOL IV: SIGNAL PROCESSING FOR COMMUNICATIONS; VOL V: SIGNAL PROCESSING EDUCATION SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO & ELECTROACOUSTICS; VOL VI: SIGNAL PROCESSING THEORY & METHODS STUDENT FORUM, 2001, : 365 - 368
  • [3] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
    Ali Raheem Mandeel
    Mohammed Salah Al-Radhi
    Tamás Gábor Csapó
    Multimedia Tools and Applications, 2023, 82 : 15635 - 15649
  • [4] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
    Mandeel, Ali Raheem
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (10) : 15635 - 15649
  • [5] NEURAL NETWORK SYNTHESIZER OF PAUSE DURATION FOR MANDARINE TEXT-TO-SPEECH
    HWANG, SH
    CHEN, SH
    ELECTRONICS LETTERS, 1992, 28 (08) : 720 - 721
  • [6] Prosodic boundary prediction model for Vietnamese text-to-speech
    Trang, Nguyen Thi Thu
    Ky, Nguyen Hoang
    Rilliard, Albert
    D'Alessandro, Christophe
    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 5 : 3366 - 3370
  • [7] Prosodic Boundary Prediction Model for Vietnamese Text-To-Speech
    Nguyen Thi Thu Trang
    Nguyen Hoang Ky
    Rilliard, Albert
    d'Alessandro, Christophe
    INTERSPEECH 2021, 2021, : 3885 - 3889
  • [8] Precise tone generation for Vietnamese text-to-speech system
    Do, TT
    Takar, T
    2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 504 - 507
  • [9] A Speaker-Adaptive HMM-based Vietnamese Text-to-Speech System
    Ninh, Duy Khanh
    PROCEEDINGS OF 2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2019), 2019, : 342 - 346
  • [10] Development of Assamese Text-to-speech System using Deep Neural Network
    Deka, Abhash
    Sarmah, Priyankoo
    Samudravijaya, K.
    Prasanna, S. R. M.
    2019 25TH NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2019,