Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis

被引:3
|
作者
Wang, Xin [1 ,2 ]
Takaki, Shinji [1 ]
Yamagishi, Junichi [1 ,2 ,3 ]
机构
[1] Natl Inst Informat, Tokyo 1018430, Japan
[2] SOKENDAI, Tokyo 1018430, Japan
[3] Univ Edinburgh, CSTR, Edinburgh EH8 9LW, Midlothian, Scotland
来源
基金
英国工程与自然科学研究理事会;
关键词
text-to-speech; speech synthesis; recurrent neural network; contexts; word embedding;
D O I
10.1587/transinf.2016SLP0011
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Building high-quality text-to-speech (TTS) systems without expert knowledge of the target language and/or time-consuming manual annotation of speech and text data is an important yet challenging research topic. In this kind of TTS system, it is vital to find representation of the input text that is both effective and easy to acquire. Recently, the continuous representation of raw word inputs, called "word embedding", has been successfully used in various natural language processing tasks. It has also been used as the additional or alternative linguistic input features to a neural-network-based acoustic model for TTS systems. In this paper, we further investigate the use of this embedding technique to represent phonemes, syllables and phrases for the acoustic model based on the recurrent and feed-forward neural network. Results of the experiments show that most of these continuous representations cannot significantly improve the system's performance when they are fed into the acoustic model either as additional component or as a replacement of the conventional prosodic context. However, subjective evaluation shows that the continuous representation of phrases can achieve significant improvement when it is combined with the prosodic context as input to the acoustic model based on the feed-forward neural network.
引用
收藏
页码:2471 / 2480
页数:10
相关论文
共 50 条
  • [41] [Invited] Generative Model-Based Text-to-Speech Synthesis
    Zen, Heiga
    2018 IEEE 7TH GLOBAL CONFERENCE ON CONSUMER ELECTRONICS (GCCE 2018), 2018, : 327 - 328
  • [42] Learning to Maximize Speech Quality Directly Using MOS Prediction for Neural Text-to-Speech
    Choi, Yeunju
    Jung, Youngmoon
    Suh, Youngjoo
    Kim, Hoirin
    IEEE ACCESS, 2022, 10 : 52621 - 52629
  • [43] A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis
    Ribeiro, Manuel Sam
    Yamagishi, Junichi
    Clark, Robert A. J.
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1586 - 1590
  • [44] Creation of HMM-based Speech Model for Estonian Text-to-Speech Synthesis
    Nurk, Tonis
    HUMAN LANGUAGE TECHNOLOGIES: THE BALTIC PERSPECTIVE, 2012, 247 : 162 - 168
  • [45] Text-To-Speech quality evaluation based on LSTM Recurrent Neural Networks
    Tang, Meng
    Zhu, Jie
    2019 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS (ICNC), 2019, : 260 - 264
  • [46] Text-to-speech synthesis using spectral modeling based on non-negative autoencoder
    Gorai, Takeru
    Saito, Daisuke
    Minematsu, Nobuaki
    INTERSPEECH 2022, 2022, : 1621 - 1625
  • [47] Text-To-Speech with cross-lingual Neural Network-based grapheme-to-phoneme models
    Gonzalvo, Xavi
    Podsiadlo, Monika
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 765 - 769
  • [48] A Smart Control System for the Oil Industry Using Text-to-Speech Synthesis Based on IIoT
    Mandeel, Ali Raheem
    Aggar, Ammar Abdullah
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    ELECTRONICS, 2023, 12 (16)
  • [49] Chinese Prosody Generation Based on C-ToBI Representation for Text-To-Speech
    Kim, Byeongchang
    ADVANCES IN COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, PROCEEDINGS, 2010, 6059 : 558 - 571
  • [50] SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis
    Maniati, Georgia
    Vioni, Alexandra
    Ellinas, Nikolaos
    Nikitaras, Karolos
    Klapsas, Konstantinos
    Sung, June Sig
    Jho, Gunu
    Chalamandaris, Aimilios
    Tsiakoulis, Pirros
    INTERSPEECH 2022, 2022, : 2388 - 2392