Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

被引:11
|
作者
Inoue, Katsuki [1 ]
Hara, Sunao [1 ]
Abe, Masanobu [1 ]
Hojo, Nobukatsu [2 ]
Ijima, Yusuke [2 ]
机构
[1] Okayama Univ, Grad Sch Interdisciplinary Sci & Engn Hlth Syst, Okayama, Japan
[2] NTT Corp, Tokyo, Japan
关键词
Emotional speech synthesis; Extrapolation; DNN-based TTS; Text-to-speech; Acoustic model; Phoneme duration model; SPEAKER ADAPTATION; ALGORITHMS;
D O I
10.1016/j.specom.2020.11.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of "extrapolate emotional expressions'' is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactory performances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech uttered by target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional feature and to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serial model (SM), auxiliary input model (AIM), and hybrid models (PM&AIM and SM&AIM). These models are trained through emotional speech uttered by few speakers and neutral speech uttered by many speakers. Objective evaluations demonstrate that the performances in the open-emotion test provide insufficient information. They make a comparison with those in the closed-emotion test, but each speaker has their own manner of expressing emotion. However, subjective evaluation results indicate that the proposed models could convey emotional information to some extent. Notably, the PM can correctly convey sad and joyful emotions at a rate of >60%.
引用
收藏
页码:35 / 43
页数:9
相关论文
共 50 条
  • [31] DNN-Based Speech Synthesis for Arabic: Modelling and Evaluation
    Houidhek, Amal
    Colotte, Vincent
    Mnasri, Zied
    Jouvet, Denis
    STATISTICAL LANGUAGE AND SPEECH PROCESSING, SLSP 2018, 2018, 11171 : 9 - 20
  • [32] DNN-BASED SPEECH MASK ESTIMATION FOR EIGENVECTOR BEAMFORMING
    Pfeifenberger, Lukas
    Zoehrer, Matthias
    Pernkopf, Franz
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 66 - 70
  • [33] Multi-speaker Emotional Text-to-speech Synthesizer
    Cho, Sungjae
    Lee, Soo-Young
    INTERSPEECH 2021, 2021, : 2337 - 2338
  • [34] DNN-based speech watermarking resistant to desynchronization attacks
    Pavlovic, Kosta
    Kovacevic, Slavko
    Djurovic, Igor
    Wojciechowski, Adam
    INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2023, 21 (05)
  • [35] A study of speaker adaptation for DNN-based speech synthesis
    Wu, Zhizheng
    Swietojanski, Pawel
    Veaux, Christophe
    Renals, Steve
    King, Simon
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 879 - 883
  • [36] DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface
    Csapo, Temas Gabor
    Grosz, Tamas
    Gosztolya, Gabor
    Toth, Laszlo
    Marko, Alexandra
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3672 - 3676
  • [37] A hybrid model for text-to-speech synthesis
    Violaro, F
    Boeffard, O
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (05): : 426 - 434
  • [38] Learning Emotional Representations from Imbalanced Speech Data for Speech Emotion Recognition and Emotional Text-to-Speech
    Wang, Shijun
    Gudnason, Jon
    Borth, Damian
    INTERSPEECH 2023, 2023, : 351 - 355
  • [39] A RULE BASED PROSODY MODEL FOR TURKISH TEXT-TO-SPEECH SYNTHESIS
    Uslu, Ibrahim Baran
    Ilk, Hakki Gokhan
    Yilmaz, Asim Egemen
    TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2013, 20 (02): : 217 - 223
  • [40] [Invited] Generative Model-Based Text-to-Speech Synthesis
    Zen, Heiga
    2018 IEEE 7TH GLOBAL CONFERENCE ON CONSUMER ELECTRONICS (GCCE 2018), 2018, : 327 - 328