Investigation of Using Continuous Representation of Various Linguistic Units in Neural Network Based Text-to-Speech Synthesis

被引:3
|
作者
Wang, Xin [1 ,2 ]
Takaki, Shinji [1 ]
Yamagishi, Junichi [1 ,2 ,3 ]
机构
[1] Natl Inst Informat, Tokyo 1018430, Japan
[2] SOKENDAI, Tokyo 1018430, Japan
[3] Univ Edinburgh, CSTR, Edinburgh EH8 9LW, Midlothian, Scotland
来源
基金
英国工程与自然科学研究理事会;
关键词
text-to-speech; speech synthesis; recurrent neural network; contexts; word embedding;
D O I
10.1587/transinf.2016SLP0011
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Building high-quality text-to-speech (TTS) systems without expert knowledge of the target language and/or time-consuming manual annotation of speech and text data is an important yet challenging research topic. In this kind of TTS system, it is vital to find representation of the input text that is both effective and easy to acquire. Recently, the continuous representation of raw word inputs, called "word embedding", has been successfully used in various natural language processing tasks. It has also been used as the additional or alternative linguistic input features to a neural-network-based acoustic model for TTS systems. In this paper, we further investigate the use of this embedding technique to represent phonemes, syllables and phrases for the acoustic model based on the recurrent and feed-forward neural network. Results of the experiments show that most of these continuous representations cannot significantly improve the system's performance when they are fed into the acoustic model either as additional component or as a replacement of the conventional prosodic context. However, subjective evaluation shows that the continuous representation of phrases can achieve significant improvement when it is combined with the prosodic context as input to the acoustic model based on the feed-forward neural network.
引用
收藏
页码:2471 / 2480
页数:10
相关论文
共 50 条
  • [1] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
    Ali Raheem Mandeel
    Mohammed Salah Al-Radhi
    Tamás Gábor Csapó
    Multimedia Tools and Applications, 2023, 82 : 15635 - 15649
  • [2] Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
    Mandeel, Ali Raheem
    Al-Radhi, Mohammed Salah
    Csapo, Tamas Gabor
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (10) : 15635 - 15649
  • [3] Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
    Yasuda, Yusuke
    Wang, Xin
    Yamagishi, Junichi
    COMPUTER SPEECH AND LANGUAGE, 2021, 67
  • [4] Development of Assamese Text-to-speech System using Deep Neural Network
    Deka, Abhash
    Sarmah, Priyankoo
    Samudravijaya, K.
    Prasanna, S. R. M.
    2019 25TH NATIONAL CONFERENCE ON COMMUNICATIONS (NCC), 2019,
  • [5] Articulatory Text-to-Speech Synthesis using the Digital Waveguide Mesh driven by a Deep Neural Network
    Gully, Amelia J.
    Yoshimura, Takenori
    Murphy, Damian T.
    Hashimoto, Kei
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 234 - 238
  • [6] Prosody modeling for syllable based text-to-speech synthesis using feedforward neural networks
    Reddy, V. Ramu
    Rao, K. Sreenivasa
    NEUROCOMPUTING, 2016, 171 : 1323 - 1334
  • [7] PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH
    Karlapati, Sri
    Abbas, Ammar
    Hodari, Zack
    Moinet, Alexis
    Joly, Arnaud
    Karanasou, Penny
    Drugman, Thomas
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6573 - 6577
  • [8] Controllable neural text-to-speech synthesis using intuitive prosodic features
    Raitio, Tuomo
    Rasipuram, Ramya
    Castellani, Dan
    INTERSPEECH 2020, 2020, : 4432 - 4436
  • [9] Automatic generation of synthesis units for trainable text-to-speech systems
    Hon, H
    Acero, A
    Huang, X
    Liu, J
    Plumpe, M
    PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 293 - 296
  • [10] A STUDY ON NEURAL-NETWORK-BASED TEXT-TO-SPEECH ADAPTATION TECHNIQUES FOR VIETNAMESE
    Pham Ngoc Phuong
    Chung Tran Quang
    Quoc Truong Do
    Mai Chi Luong
    2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 199 - 205