Prosody modeling for syllable based text-to-speech synthesis using feedforward neural networks

被引:11
|
作者
Reddy, V. Ramu [1 ]
Rao, K. Sreenivasa [1 ]
机构
[1] Indian Inst Technol, Sch Informat Technol, Kharagpur 721302, W Bengal, India
关键词
Prosody; Text-to-speech synthesis; Feed-forward neural networks; Phonological features; Positional and contextual features; Articulatory features; DURATION;
D O I
10.1016/j.neucom.2015.07.053
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Prosody plays an important role in improving the quality of text-to-speech synthesis (TTS) system. In this paper, features related to the linguistic and the production constraints are proposed for modeling the prosodic parameters such as duration, intonation and intensities of the syllables. The linguistic constraints are represented by positional, contextual and phonological features, and the production constraints are represented by articulatory features. Neural network models are explored to capture the implicit duration, F-0 and intensity knowledge using above mentioned features. The prediction performance of the proposed neural network models is evaluated using objective measures such as average prediction error (mu), standard deviation (sigma) and linear correlation coefficient (gamma(X,Y)). The prediction accuracy of the proposed neural network models is compared with other state-of-the-art prosody models used in TTS systems. The prediction accuracy of the proposed prosody models is also verified by conducting listening tests, after integrating the proposed prosody models to the baseline TTS system. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:1323 / 1334
页数:12
相关论文
共 50 条
  • [41] Deep Neural Networks for Syllable based Acoustic Modeling in Chinese Speech Recognition
    Li, Xiangang
    Hong, Caifu
    Yang, Yuning
    Wu, Xihong
    [J]. 2013 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2013,
  • [42] Text and Speech Corpora for Text-To-Speech Synthesis of Tales
    Doukhan, David
    Rosset, Sophie
    Rilliard, Albert
    d'Alessandro, Christophe
    Adda-Decker, Martine
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1003 - 1010
  • [43] Epochs Based Compression of LP Residual for Source Modeling in Text-to-Speech Synthesis
    Adiga, Nagaraj
    Prasanna, S. R. Mahadeva
    [J]. 2014 TWENTIETH NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2014,
  • [44] Syllable Based Concatenative Synthesis for Text to Speech Conversion
    Ananthi, S.
    Dhanalakshmi, P.
    [J]. COMPUTATIONAL INTELLIGENCE IN DATA MINING, VOL 3, 2015, 33
  • [45] Multilingual text-to-speech synthesis
    Black, AW
    Lenzo, KA
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 761 - 764
  • [46] An introduction to text-to-speech synthesis
    Fitzpatrick, E
    [J]. COMPUTATIONAL LINGUISTICS, 1998, 24 (02) : 322 - 323
  • [47] Improving text-to-speech synthesis
    Tatham, M
    Lewis, E
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1856 - 1859
  • [48] Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (10): : 2471 - 2480
  • [49] Issues in text-to-speech synthesis
    Macchi, M
    [J]. IEEE INTERNATIONAL JOINT SYMPOSIA ON INTELLIGENCE AND SYSTEMS - PROCEEDINGS, 1998, : 318 - 325
  • [50] Natural prosody generation for domain specific text-to-speech systems
    Katae, N
    Kimura, S
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1852 - 1855