Prosodic Boundary Prediction Model for Vietnamese Text-To-Speech

被引:1
|
作者
Nguyen Thi Thu Trang [1 ]
Nguyen Hoang Ky [1 ]
Rilliard, Albert [2 ]
d'Alessandro, Christophe [3 ]
机构
[1] Hanoi Univ Sci & Technol, Hanoi, Vietnam
[2] Univ Paris Saclay, CNRS, LISN, Gif Sur Yvette, France
[3] Sorbonne Univ, Inst Jean le Rond dAlembert, UMR7190 CNRS, Paris, France
来源
关键词
Prosody modeling; prosodic boundary; pause prediction; Text-To-Speech; speech synthesis; Vietnamese;
D O I
10.21437/Interspeech.2021-125
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This research aims to build a prosodic boundary prediction model for improving the naturalness of Vietnamese speech synthesis. This model can be used directly to predict prosodic boundaries in the synthesis phase of the statistical parametric or end-to-end speech systems. Beside conventional features related to Part-Of-Speech (POS), this paper proposes two efficient features to predict prosodic boundaries: syntactic blocks and syntactic links, based on a thorough analysis of a Vietnamese dataset. Syntactic blocks are syntactic phrases whose sizes are bounded in their constituent syntactic tree. A syntactic link of two adjacent words is calculated based on the distance between them in the syntax tree. The experimental results show that the two proposed predictors improve the quality of the boundary prediction model using a decision tree classification algorithm, about 36.4% (F1 score) higher than the model with only POS features. The final boundary prediction model with POS, syntactic block, and syntactic link features using the LightGBM algorithm gives the best F1 -score results at 87.0% in test data. The proposed model helps the TTS systems, developed by either HMM-based, DNN-based, or End-to-end speech synthesis techniques, improve about 0.3 MOS points (i.e. 6 to 10%) compared to the ones without the proposed model.
引用
收藏
页码:3885 / 3889
页数:5
相关论文
共 50 条
  • [21] PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH
    Karlapati, Sri
    Abbas, Ammar
    Hodari, Zack
    Moinet, Alexis
    Joly, Arnaud
    Karanasou, Penny
    Drugman, Thomas
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6573 - 6577
  • [22] A prosodic diphone database for Korean text-to-speech synthesis system
    Yoon, K
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 425 - 428
  • [23] TIME-DOMAIN PROSODIC MODIFICATIONS FOR TEXT-TO-SPEECH SYNTHESIZER
    Lopatka, Kuba
    Suchomski, Piotr
    Czyzewski, Andrzej
    SPA 2010: SIGNAL PROCESSING ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS CONFERENCE PROCEEDINGS, 2010, : 73 - 77
  • [24] TEXT-TO-SPEECH CONVERSION SYSTEM TO DEVELOP PROSODIC RULES.
    Mikuni, Ichiro
    Ohta, Kozo
    Denshi Gijutsu Sogo Kenkyusho Iho/Bulletin of the Electrotechnical Laboratory, 1988, 52 (03): : 82 - 87
  • [25] Automatic prosodic modeling for speaker and task adaptation in text-to-speech
    LopezGonzalo, E
    RodriguezGarcia, JM
    HernandezGomez, L
    Villar, JM
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 927 - 930
  • [26] Derivation of prosody for text-to-speech from prosodic sentence structure
    Quene, Hugo
    Kager, Rene
    Computer Speech and Language, 1992, 6 (01): : 77 - 98
  • [27] A method for estimating prosodic symbol from text for Japanese text-to-speech synthesis
    Magata, K
    Hamagami, T
    Komura, M
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1373 - 1376
  • [28] The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset
    Tran, Duc Chung
    DATA IN BRIEF, 2020, 31
  • [29] Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech System
    Viet Lam Phung
    Huy Kinh Phan
    Anh Tuan Dinh
    Quoc Bao Nguyen
    PROCEEDINGS OF 2020 23RD CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (ORIENTAL-COCOSDA 2020), 2020, : 1 - 6
  • [30] Controllable neural text-to-speech synthesis using intuitive prosodic features
    Raitio, Tuomo
    Rasipuram, Ramya
    Castellani, Dan
    INTERSPEECH 2020, 2020, : 4432 - 4436