Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis

被引:3
|
作者
Wang, Xin [1 ,2 ]
Takaki, Shinji [1 ]
Yamagishi, Junichi [1 ,2 ,3 ]
机构
[1] Natl Inst Informat, Tokyo 1018430, Japan
[2] SOKENDAI, Tokyo 1018430, Japan
[3] Univ Edinburgh, CSTR, Edinburgh EH8 9LW, Midlothian, Scotland
来源
基金
英国工程与自然科学研究理事会;
关键词
text-to-speech; speech synthesis; recurrent neural network; contexts; word embedding;
D O I
10.1587/transinf.2016SLP0011
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Building high-quality text-to-speech (TTS) systems without expert knowledge of the target language and/or time-consuming manual annotation of speech and text data is an important yet challenging research topic. In this kind of TTS system, it is vital to find representation of the input text that is both effective and easy to acquire. Recently, the continuous representation of raw word inputs, called "word embedding", has been successfully used in various natural language processing tasks. It has also been used as the additional or alternative linguistic input features to a neural-network-based acoustic model for TTS systems. In this paper, we further investigate the use of this embedding technique to represent phonemes, syllables and phrases for the acoustic model based on the recurrent and feed-forward neural network. Results of the experiments show that most of these continuous representations cannot significantly improve the system's performance when they are fed into the acoustic model either as additional component or as a replacement of the conventional prosodic context. However, subjective evaluation shows that the continuous representation of phrases can achieve significant improvement when it is combined with the prosodic context as input to the acoustic model based on the feed-forward neural network.
引用
收藏
页码:2471 / 2480
页数:10
相关论文
共 50 条
  • [21] ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL
    Fujita, Kenichi
    Ashihara, Takanori
    Kanagawa, Hiroki
    Moriya, Takafumi
    Ijima, Yusuke
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [22] Comparative Study of Text-to-Speech Synthesis Techniques for Mobile Linguistic Translation Process
    Chomwihoke, Phanchita
    Phankokkruad, Manop
    2014 IEEE INTERNATIONAL CONFERENCE ON CONTROL SYSTEM COMPUTING AND ENGINEERING, 2014, : 449 - 454
  • [23] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
    Tu, Tao
    Chen, Yuan-Jui
    Liu, Alexander H.
    Lee, Hung-yi
    INTERSPEECH 2020, 2020, : 3191 - 3195
  • [24] Intensity Modeling for Syllable Based Text-to-Speech Synthesis
    Reddy, V. Ramu
    Rao, K. Sreenivasa
    CONTEMPORARY COMPUTING, 2012, 306 : 106 - 117
  • [25] Residual-based speech modification algorithms for text-to-speech synthesis
    Edgington, M
    Lowry, A
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1425 - 1428
  • [26] Two-stage intonation modeling using feedforward neural networks for syllable based text-to-speech synthesis
    Reddy, V. Ramu
    Rao, K. Sreenivasa
    COMPUTER SPEECH AND LANGUAGE, 2013, 27 (05): : 1105 - 1126
  • [27] EXPLORING END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS FOR ROMANIAN
    Dumitrache, Marius
    Rebedea, Traian
    PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING, 2020, : 93 - 102
  • [28] PARAMETER GENERATION ALGORITHMS FOR TEXT-TO-SPEECH SYNTHESIS WITH RECURRENT NEURAL NETWORKS
    Klimkov, Viacheslav
    Moinet, Alexis
    Nadolski, Adam
    Drugman, Thomas
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 626 - 631
  • [29] Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks
    Valentini-Botinhao, Cassia
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 352 - 356
  • [30] Low-level articulatory synthesis: A working text-to-speech solution and a linguistic tool
    Hill, David R.
    Taube-Schock, Craig R.
    Manzara, Leonard
    CANADIAN JOURNAL OF LINGUISTICS-REVUE CANADIENNE DE LINGUISTIQUE, 2017, 62 (03): : 371 - 410