FINE-GRAINED EMOTION STRENGTH TRANSFER, CONTROL AND PREDICTION FOR EMOTIONAL SPEECH SYNTHESIS

被引:32
|
作者
Lei, Yi [1 ]
Yang, Shan [1 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
关键词
text-to-speech; expressive speech synthesis; emotion strength; sequence-to-sequence;
D O I
10.1109/SLT48900.2021.9383524
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a unified model to conduct emotion transfer, control and prediction for sequence-to-sequence based fine-grained emotional speech synthesis. Conventional emotional speech synthesis often needs manual labels or reference audio to determine the emotional expressions of synthesized speech. Such coarse labels cannot control the details of speech emotion, often resulting in an averaged emotion expression delivery, and it is also hard to choose suitable reference audio during inference. To conduct fine-grained emotion expression generation, we introduce phoneme-level emotion strength representations through a learned ranking function to describe the local emotion details, and the sentence-level emotion category is adopted to render the global emotions of synthesized speech. With the global render and local descriptors of emotions, we can obtain fine-grained emotion expressions from reference audio via its emotion descriptors (for transfer) or directly from phoneme-level manual labels (for control). As for the emotional speech synthesis with arbitrary text inputs, the proposed model can also predict phoneme-level emotion expressions from texts, which does not require any reference audio or manual label.
引用
收藏
页码:423 / 430
页数:8
相关论文
共 50 条
  • [1] Improving Fine-Grained Emotion Control and Transfer with Gated Emotion Representations in Speech Synthesis
    Ye, Jianhao
    He, Tianwei
    Zhou, Hongbin
    Ren, Kaimeng
    He, Wendi
    Lu, Heng
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 196 - 207
  • [2] Fine-grained Noise Control for Multispeaker Speech Synthesis
    Nikitaras, Karolos
    Vamvoukakis, Georgios
    Ellinas, Nikolaos
    Klapsas, Konstantinos
    Markopoulos, Konstantinos
    Raptis, Spyros
    Sung, June Sig
    Jho, Gunu
    Chalamandaris, Aimilios
    Tsiakoulis, Pirros
    INTERSPEECH 2022, 2022, : 828 - 832
  • [3] Fine-Grained Emotion Prediction by Modeling Emotion Definitions
    Singh, Gargi
    Brahma, Dhanajit
    Rai, Piyush
    Modi, Ashutosh
    2021 9TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2021,
  • [4] EMOTION NEURAL TRANSDUCER FOR FINE-GRAINED SPEECH EMOTION RECOGNITION
    Shen, Siyuan
    Gao, Yu
    Liu, Feng
    Wang, Hanyang
    Zhou, Aimin
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 10111 - 10115
  • [5] MULTI-SPEAKER EMOTIONAL SPEECH SYNTHESIS WITH FINE-GRAINED PROSODY MODELING
    Lu, Chunhui
    Wen, Xue
    Liu, Ruolan
    Chen, Xiao
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5729 - 5733
  • [6] Text-Based Fine-Grained Emotion Prediction
    Singh, Gargi
    Brahma, Dhanajit
    Rai, Piyush
    Modi, Ashutosh
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2024, 15 (02) : 405 - 416
  • [7] MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis
    Lei, Yi
    Yang, Shan
    Wang, Xinsheng
    Xie, Lei
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 853 - 864
  • [8] EMOTION-CONTROLLABLE SPEECH SYNTHESIS USING EMOTION SOFT LABELS AND FINE-GRAINED PROSODY FACTORS
    Luo, Xuan
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saito, Yuki
    Saruwatari, Hiroshi
    2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 794 - 799
  • [9] EMOQ-TTS: EMOTION INTENSITY QUANTIZATION FOR FINE-GRAINED CONTROLLABLE EMOTIONAL TEXT-TO-SPEECH
    Im, Chae-Bin
    Lee, Sang-Hoon
    Kim, Seung-Bin
    Lee, Seong-Whan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6317 - 6321
  • [10] PiCo-VITS: Leveraging Pitch Contours for Fine-Grained Emotional Speech Synthesis
    Wong, Kwan-yeung
    Chung, Fu-lai
    TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT II, 2024, 15049 : 210 - 221