CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS

被引:18
|
作者
Guo, Haohan [1 ,3 ]
Zhang, Shaofei [2 ]
Soong, Frank K. [2 ]
He, Lei [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
[3] Microsoft, Redmond, WA USA
关键词
Text-to-Speech; End-to-End; Conversational TTS; Speech Corpus; Voice Agent;
D O I
10.1109/SLT48900.2021.9383460
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-toend TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.
引用
收藏
页码:403 / 409
页数:7
相关论文
共 50 条
  • [1] End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents
    Khan, Fahim Shahriar
    Al Mushabbir, Mueeze
    Irbaz, Mohammad Sabik
    Al Nasim, Md Abdullah
    [J]. 20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 205 - 210
  • [2] An end-to-end TTS model with pronunciation predictor
    Han C.-J.
    Ri U.-C.
    Mun S.-I.
    Jang K.-S.
    Han S.-Y.
    [J]. International Journal of Speech Technology, 2022, 25 (4) : 1013 - 1024
  • [3] An end-to-end TTS model with pronunciation predictor
    Han, Chol-Jin
    Ri, Un-Chol
    Mun, Song-Il
    Jang, Kang-Song
    Han, Song-Yun
    [J]. International Journal of Speech Technology, 2022, 25 (04) : 1013 - 1024
  • [4] INVESTIGATING CONTEXT FEATURES HIDDEN IN END-TO-END TTS
    Mametani, Kohki
    Kato, Tsuneo
    Yamamoto, Seiichi
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6920 - 6924
  • [5] An End-to-End Conversational Style Matching Agent
    Hoegen, Rens
    Aneja, Deepali
    McDuff, Daniel
    Czerwinski, Mary
    [J]. PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS (IVA' 19), 2019, : 111 - 118
  • [6] An embedded end-to-end voice assistant
    Lazzaroni, Luca
    Bellotti, Francesco
    Berta, Riccardo
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 136
  • [7] Forward-Backward Decoding for Regularizing End-to-End TTS
    Zheng, Yibin
    Wang, Xi
    He, Lei
    Pan, Shifeng
    Soong, Frank K.
    Wen, Zhengqi
    Tao, Jianhua
    [J]. INTERSPEECH 2019, 2019, : 1283 - 1287
  • [8] TOWARDS USING HETEROGENEOUS RELATION GRAPHS FOR END-TO-END TTS
    Setlur, Amrith
    Madaan, Aman
    Parekh, Tanmay
    Yang, Yining
    Black, Alan W.
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1162 - 1169
  • [9] End-to-End Network Simulator for Conversational Quality Measurements
    Holub, Jan
    Micka, Jan
    [J]. WTS: 2009 WIRELESS TELECOMMUNICATIONS SYMPOSIUM, 2009, : 94 - 97
  • [10] Towards End-to-End Open Conversational Machine Reading
    Zhou, Sizhe
    Ouyang, Siru
    Zhang, Zhuosheng
    Zhao, Hai
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2064 - 2076