Prosodic Boundary Prediction Model for Vietnamese Text-To-Speech

被引:1
|
作者
Nguyen Thi Thu Trang [1 ]
Nguyen Hoang Ky [1 ]
Rilliard, Albert [2 ]
d'Alessandro, Christophe [3 ]
机构
[1] Hanoi Univ Sci & Technol, Hanoi, Vietnam
[2] Univ Paris Saclay, CNRS, LISN, Gif Sur Yvette, France
[3] Sorbonne Univ, Inst Jean le Rond dAlembert, UMR7190 CNRS, Paris, France
来源
关键词
Prosody modeling; prosodic boundary; pause prediction; Text-To-Speech; speech synthesis; Vietnamese;
D O I
10.21437/Interspeech.2021-125
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This research aims to build a prosodic boundary prediction model for improving the naturalness of Vietnamese speech synthesis. This model can be used directly to predict prosodic boundaries in the synthesis phase of the statistical parametric or end-to-end speech systems. Beside conventional features related to Part-Of-Speech (POS), this paper proposes two efficient features to predict prosodic boundaries: syntactic blocks and syntactic links, based on a thorough analysis of a Vietnamese dataset. Syntactic blocks are syntactic phrases whose sizes are bounded in their constituent syntactic tree. A syntactic link of two adjacent words is calculated based on the distance between them in the syntax tree. The experimental results show that the two proposed predictors improve the quality of the boundary prediction model using a decision tree classification algorithm, about 36.4% (F1 score) higher than the model with only POS features. The final boundary prediction model with POS, syntactic block, and syntactic link features using the LightGBM algorithm gives the best F1 -score results at 87.0% in test data. The proposed model helps the TTS systems, developed by either HMM-based, DNN-based, or End-to-end speech synthesis techniques, improve about 0.3 MOS points (i.e. 6 to 10%) compared to the ones without the proposed model.
引用
收藏
页码:3885 / 3889
页数:5
相关论文
共 50 条
  • [31] Prosodic rules for schwa-deletion in hindi text-to-speech synthesis
    Tyson, Na'im
    Nagar, Ila
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2009, 12 (01) : 15 - 25
  • [32] An RNN-based prosodic information synthesizer for Mandarin text-to-speech
    Chen, SH
    Hwang, SH
    Wang, YR
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (03): : 226 - 239
  • [33] BLSTM-CRF Based End-to-End Prosodic Boundary Prediction with Context Sensitive Embeddings in A Text-to-Speech Front-End
    Zheng, Yibin
    Tao, Jianhua
    Wen, Zhengqi
    Li, Ya
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 47 - 51
  • [34] FASTPITCH: PARALLEL TEXT-TO-SPEECH WITH PITCH PREDICTION
    Lancucki, Adrian
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6588 - 6592
  • [35] A hybrid model for text-to-speech synthesis
    Violaro, F
    Boeffard, O
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (05): : 426 - 434
  • [36] New Method for Delexicalization and its Application to Prosodic Tagging for Text-to-Speech Synthesis
    Vainio, Martti
    Suni, Antti
    Raitio, Tuomo
    Nurminen, Jani
    Jarvikivi, Juhani
    Alku, Paavo
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1671 - 1674
  • [37] ENGLISH NOUN PHRASE ACCENT PREDICTION FOR TEXT-TO-SPEECH
    SPROAT, R
    COMPUTER SPEECH AND LANGUAGE, 1994, 8 (02): : 79 - 94
  • [38] The pause duration prediction for mandarin text-to-speech system
    Yu, J
    Tao, JH
    Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE'05), 2005, : 204 - 208
  • [39] Syllable duration prediction for Farsi text-to-speech systems
    Nazari, B.
    Nayebi, K.
    Sheikhzadeh, H.
    Scientia Iranica, 2004, 11 (03) : 225 - 233
  • [40] Automatic Pitch Accent Prediction for Text-To-Speech Synthesis
    Read, Ian
    Cox, Stephen
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2085 - 2088