Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

被引:11
|
作者
Inoue, Katsuki [1 ]
Hara, Sunao [1 ]
Abe, Masanobu [1 ]
Hojo, Nobukatsu [2 ]
Ijima, Yusuke [2 ]
机构
[1] Okayama Univ, Grad Sch Interdisciplinary Sci & Engn Hlth Syst, Okayama, Japan
[2] NTT Corp, Tokyo, Japan
关键词
Emotional speech synthesis; Extrapolation; DNN-based TTS; Text-to-speech; Acoustic model; Phoneme duration model; SPEAKER ADAPTATION; ALGORITHMS;
D O I
10.1016/j.specom.2020.11.004
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of "extrapolate emotional expressions'' is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactory performances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech uttered by target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional feature and to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serial model (SM), auxiliary input model (AIM), and hybrid models (PM&AIM and SM&AIM). These models are trained through emotional speech uttered by few speakers and neutral speech uttered by many speakers. Objective evaluations demonstrate that the performances in the open-emotion test provide insufficient information. They make a comparison with those in the closed-emotion test, but each speaker has their own manner of expressing emotion. However, subjective evaluation results indicate that the proposed models could convey emotional information to some extent. Notably, the PM can correctly convey sad and joyful emotions at a rate of >60%.
引用
收藏
页码:35 / 43
页数:9
相关论文
共 50 条
  • [1] DNN-based grapheme-to-phoneme conversion for Arabic text-to-speech synthesis
    Ikbel Hadj Ali
    Zied Mnasri
    Zied Lachiri
    International Journal of Speech Technology, 2020, 23 : 569 - 584
  • [2] DNN-based grapheme-to-phoneme conversion for Arabic text-to-speech synthesis
    Ali, Ikbel Hadj
    Mnasri, Zied
    Lachiri, Zied
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (03) : 569 - 584
  • [3] Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis
    Ribeiro, Manuel Sam
    Watts, Oliver
    Yamagishi, Junichi
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3186 - 3190
  • [4] A DNN-based emotional speech synthesis by speaker adaptation
    Yang, Hongwu
    Zhang, Weizhao
    Zhi, Pengpeng
    2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 633 - 637
  • [5] An Investigation to Transplant Emotional Expressions in DNN-based TTS Synthesis
    Inoue, Katsuki
    Hara, Sunao
    Abe, Masanobu
    Hojo, Nobukatsu
    Ijima, Yusuke
    2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 1294 - 1299
  • [6] Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system
    Juvela, Lauri
    Bollepalli, Bajibabu
    Yamagishi, Junichi
    Alku, Paavo
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1368 - 1372
  • [7] DNN Based Expressive Text-to-Speech with Limited Training Data
    Suzic, Sinisa
    Nosek, Tijana
    Secujski, Milan
    Pekar, Darko
    Delic, Vlado
    2019 27TH TELECOMMUNICATIONS FORUM (TELFOR 2019), 2019, : 293 - 298
  • [8] DNN-BASED SPEECH ENHANCEMENT USING MBE MODEL
    Huang, Qizheng
    Bao, Changchun
    Wang, Xianyun
    Xiang, Yang
    2018 16TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), 2018, : 196 - 200
  • [9] Text aware Emotional Text-to-speech with BERT
    Mukherjee, Arijit
    Bansal, Shubham
    Satpal, Sandeepkumar
    Mehta, Rupesh
    INTERSPEECH 2022, 2022, : 4601 - 4605
  • [10] DNN-Based Arabic Speech Synthesis
    Amrouche, Aissa
    Bentrcia, Youssouf
    Boubakeur, Khadidja Nesrine
    Abed, Ahcene
    2022 9TH INTERNATIONAL CONFERENCE ON ELECTRICAL AND ELECTRONICS ENGINEERING (ICEEE 2022), 2022, : 378 - 382