Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

被引:11
|
作者
Inoue, Katsuki [1 ]
Hara, Sunao [1 ]
Abe, Masanobu [1 ]
Hojo, Nobukatsu [2 ]
Ijima, Yusuke [2 ]
机构
[1] Okayama Univ, Grad Sch Interdisciplinary Sci & Engn Hlth Syst, Okayama, Japan
[2] NTT Corp, Tokyo, Japan
关键词
Emotional speech synthesis; Extrapolation; DNN-based TTS; Text-to-speech; Acoustic model; Phoneme duration model; SPEAKER ADAPTATION; ALGORITHMS;
D O I
10.1016/j.specom.2020.11.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of "extrapolate emotional expressions'' is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactory performances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech uttered by target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional feature and to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serial model (SM), auxiliary input model (AIM), and hybrid models (PM&AIM and SM&AIM). These models are trained through emotional speech uttered by few speakers and neutral speech uttered by many speakers. Objective evaluations demonstrate that the performances in the open-emotion test provide insufficient information. They make a comparison with those in the closed-emotion test, but each speaker has their own manner of expressing emotion. However, subjective evaluation results indicate that the proposed models could convey emotional information to some extent. Notably, the PM can correctly convey sad and joyful emotions at a rate of >60%.
引用
收藏
页码:35 / 43
页数:9
相关论文
共 50 条
  • [11] Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech
    Sone, Kentaro
    Nakashika, Toru
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (08) : 1546 - 1553
  • [12] On the Training of DNN-based Average Voice Model for Speech Synthesis
    Yang, Shan
    Wu, Zhizheng
    Xie, Lei
    2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [13] EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model
    Cui, Chenye
    Ren, Yi
    Liu, Jinglin
    Chen, Feiyang
    Huang, Rongjie
    Lei, Ming
    Zhao, Zhou
    INTERSPEECH 2021, 2021, : 2766 - 2770
  • [14] DNN-BASED SPEECH RECOGNITION FOR GLOBALPHONE LANGUAGES
    Tachbelie, Martha Yifiru
    Abulimiti, Ayimunishagu
    Abate, Solomon Teferra
    Schultz, Tanja
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 8269 - 8273
  • [15] DNN-BASED ENHANCEMENT OF NOISY AND REVERBERANT SPEECH
    Zhao, Yan
    Wang, DeLiang
    Merks, Ivo
    Zhang, Tao
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 6525 - 6529
  • [16] DNN-based Speech Synthesis for Small Data Sets Considering Bidirectional Speech-Text Conversion
    Sone, Kentaro
    Nakashika, Toru
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2519 - 2523
  • [17] Gemination prediction using DNN for Arabic text-to-speech synthesis
    Ali, Ikbel Hadj
    Mnasri, Zied
    Laachri, Zied
    2019 16TH INTERNATIONAL MULTI-CONFERENCE ON SYSTEMS, SIGNALS & DEVICES (SSD), 2019, : 366 - 370
  • [18] DNN-based automatic speech recognition as a model for human phoneme perception
    Exter, Mats
    Meyer, Bernd T.
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 615 - 619
  • [19] Modeling and synthesizing emotional speech for Catalan text-to-speech synthesis
    Iriondo, I
    Alías, F
    Melenchón, J
    Llorca, MA
    AFFECTIVE DIALOGUE SYSTEMS, PROCEEDINGS, 2004, 3068 : 197 - 208
  • [20] Emo-TTS: Parallel Transformer-based Text-to-Speech Model with Emotional Awareness
    Osman, Mohamed
    5TH INTERNATIONAL CONFERENCE ON COMPUTING AND INFORMATICS (ICCI 2022), 2022, : 169 - 174