CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS

被引:30
|
作者
Guo, Haohan [1 ,3 ]
Zhang, Shaofei [2 ]
Soong, Frank K. [2 ]
He, Lei [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
[3] Microsoft, Redmond, WA USA
关键词
Text-to-Speech; End-to-End; Conversational TTS; Speech Corpus; Voice Agent;
D O I
10.1109/SLT48900.2021.9383460
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-toend TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.
引用
收藏
页码:403 / 409
页数:7
相关论文
共 50 条
  • [41] END-TO-END LYRICS RECOGNITION WITH VOICE TO SINGING STYLE TRANSFER
    Basak, Sakya
    Agarwal, Shrutina
    Ganapathy, Sriram
    Takahashi, Naoya
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 266 - 270
  • [42] Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS
    An, Xiaochun
    Soong, Frank K.
    Xie, Lei
    INTERSPEECH 2021, 2021, : 4688 - 4692
  • [43] SR-TTS: a rhyme-based end-to-end speech synthesis system
    Yao, Yihao
    Liang, Tao
    Feng, Rui
    Shi, Keke
    Yu, Junxiao
    Wang, Wei
    Li, Jianqing
    FRONTIERS IN NEUROROBOTICS, 2024, 18
  • [44] The MARA corpus: Expressivity in end-to-end TTS systems using synthesised speech data
    Stan, Adriana
    Lorincz, Beata
    Nutu, Maria
    Giurgiu, Mircea
    2021 INTERNATIONAL CONFERENCE ON SPEECH TECHNOLOGY AND HUMAN-COMPUTER DIALOGUE (SPED), 2021, : 85 - 90
  • [45] Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding
    Peiro-Lilja, Alex
    Farrus, Mireia
    INTERSPEECH 2020, 2020, : 3994 - 3998
  • [46] END-TO-END CODE-SWITCHING TTS WITH CROSS-LINGUAL LANGUAGE MODEL
    Zhou, Xuehao
    Tian, Xiaohai
    Lee, Grandee
    Das, Rohan Kumar
    Li, Haizhou
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7614 - 7618
  • [47] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
    Cho, Hyunjae
    Jung, Wonbin
    Lee, Junhyeok
    Woo, Sang Hoon
    INTERSPEECH 2022, 2022, : 1 - 5
  • [48] End-to-end conversational speech synthesis with controllable emotions in the dimensions of pleasantness and arousal
    Mori, Hiroki
    Nishino, Hironao
    Acoustical Science and Technology, 46 (01): : 70 - 77
  • [49] End-to-end conversational speech synthesis with controllable emotions in the dimensions of pleasantness and arousal
    Mori, Hiroki
    Nishino, Hironao
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2025, 46 (01) : 70 - 77
  • [50] IvCDS: An End-to-End Driver Simulator for Personal In-Vehicle Conversational Assistant
    Ji, Tianbo
    Yin, Xuanhua
    Cheng, Peng
    Zhou, Liting
    Liu, Siyou
    Bao, Wei
    Lyu, Chenyang
    INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2022, 19 (23)