A STUDY ON NEURAL-NETWORK-BASED TEXT-TO-SPEECH ADAPTATION TECHNIQUES FOR VIETNAMESE

被引：0

作者：

Pham Ngoc Phuong ^{[1
]}

Chung Tran Quang ^{[2
]}

Quoc Truong Do ^{[2
]}

Mai Chi Luong ^{[3
]}

机构：

[1] Thai Nguyen Univ, Thai Nguyen, Vietnam

[2] Vietnam Artificial Intelligence Solut, VAIS, Hanoi, Vietnam

[3] Vietnam Acad Sci & Technol, Inst Informat Technol, Hanoi, Vietnam

来源：

2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA) | 2021年

关键词：

Speaker adaptation; Multi-pass fine-tune; TTS adaptation; Vietnamese TTS corpus; SPEAKER ADAPTATION;

D O I：

10.1109/O-COCOSDA202152914.2021.9660445

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

One of the main goals of text-to-speech adaptation techniques is to produce a model that can generate good quality audio given a small amount of training data. In fact, TTS systems for rich-resource languages have good quality because of a large amount of data, but training models with small datasets (or low-resources) is not an easy task, which often produces low-quality sounds. One of the approaches to overcome the data limitation is fine-tuning. However, we still need a pretrained model which learns from large amount of data in advance. The paper presents two contributions: (1) a study on the amounts of data needed for a traditional fine-tuning method for Vietnamese, where we change the data and run the training for a few more iterations; (2) we present a new fine-tuning pipeline which allows us to borrow a pre-trained model from English and adapt it to any Vietnamese voices with a very small amount of data while still maintaining a good speech synthetic sound. Our experiments show that with only 4 minutes of data, we can synthesize a new voice with a good similarity score, and with 16 minutes of data, the model can generate audio with a 3.8 MOS score.

引用

页码：199 / 205

页数：7

共 50 条

[1] NEURAL-NETWORK-BASED F0 TEXT-TO-SPEECH SYNTHESIZER FOR MANDARINE
HWANG, SH
CHEN, SH
IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 1994, 141 (06): : 384 - 390
[2] Neural-network-based HMM adaptation for noisy speech
Furui, S
Itoh, D
2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING - VOL IV: SIGNAL PROCESSING FOR COMMUNICATIONS; VOL V: SIGNAL PROCESSING EDUCATION SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO & ELECTROACOUSTICS; VOL VI: SIGNAL PROCESSING THEORY & METHODS STUDENT FORUM, 2001, : 365 - 368
[3] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
Ali Raheem Mandeel
Mohammed Salah Al-Radhi
Tamás Gábor Csapó
Multimedia Tools and Applications, 2023, 82 : 15635 - 15649
[4] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
Mandeel, Ali Raheem
Al-Radhi, Mohammed Salah
Csapo, Tamas Gabor
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (10) : 15635 - 15649
[5] NEURAL NETWORK SYNTHESIZER OF PAUSE DURATION FOR MANDARINE TEXT-TO-SPEECH
HWANG, SH
CHEN, SH
ELECTRONICS LETTERS, 1992, 28 (08) : 720 - 721
[6] Prosodic boundary prediction model for Vietnamese text-to-speech
Trang, Nguyen Thi Thu
Ky, Nguyen Hoang
Rilliard, Albert
D'Alessandro, Christophe
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 5 : 3366 - 3370
[7] Prosodic Boundary Prediction Model for Vietnamese Text-To-Speech
Nguyen Thi Thu Trang
Nguyen Hoang Ky
Rilliard, Albert
d'Alessandro, Christophe
INTERSPEECH 2021, 2021, : 3885 - 3889
[8] Precise tone generation for Vietnamese text-to-speech system
Do, TT
Takar, T
2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING I, 2003, : 504 - 507
[9] A Speaker-Adaptive HMM-based Vietnamese Text-to-Speech System
Ninh, Duy Khanh
PROCEEDINGS OF 2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2019), 2019, : 342 - 346
[10] Development of Assamese Text-to-speech System using Deep Neural Network
Deka, Abhash
Sarmah, Priyankoo
Samudravijaya, K.
Prasanna, S. R. M.
2019 25TH NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2019,

← 1 2 3 4 5 →