CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS

被引:30
|
作者
Guo, Haohan [1 ,3 ]
Zhang, Shaofei [2 ]
Soong, Frank K. [2 ]
He, Lei [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
[3] Microsoft, Redmond, WA USA
关键词
Text-to-Speech; End-to-End; Conversational TTS; Speech Corpus; Voice Agent;
D O I
10.1109/SLT48900.2021.9383460
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-toend TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.
引用
收藏
页码:403 / 409
页数:7
相关论文
共 50 条
  • [1] End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents
    Khan, Fahim Shahriar
    Al Mushabbir, Mueeze
    Irbaz, Mohammad Sabik
    Al Nasim, Md Abdullah
    20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 205 - 210
  • [2] An end-to-end TTS model with pronunciation predictor
    Han C.-J.
    Ri U.-C.
    Mun S.-I.
    Jang K.-S.
    Han S.-Y.
    International Journal of Speech Technology, 2022, 25 (4) : 1013 - 1024
  • [3] An end-to-end TTS model with pronunciation predictor
    Han, Chol-Jin
    Ri, Un-Chol
    Mun, Song-Il
    Jang, Kang-Song
    Han, Song-Yun
    International Journal of Speech Technology, 2022, 25 (04) : 1013 - 1024
  • [4] INVESTIGATING CONTEXT FEATURES HIDDEN IN END-TO-END TTS
    Mametani, Kohki
    Kato, Tsuneo
    Yamamoto, Seiichi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6920 - 6924
  • [5] An End-to-End Conversational Style Matching Agent
    Hoegen, Rens
    Aneja, Deepali
    McDuff, Daniel
    Czerwinski, Mary
    PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS (IVA' 19), 2019, : 111 - 118
  • [6] An embedded end-to-end voice assistant
    Lazzaroni, Luca
    Bellotti, Francesco
    Berta, Riccardo
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 136
  • [7] DC CoMix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer
    Choi, Yerin
    Koo, Myoung-Wan
    INTERSPEECH 2023, 2023, : 2048 - 2052
  • [8] Forward-Backward Decoding for Regularizing End-to-End TTS
    Zheng, Yibin
    Wang, Xi
    He, Lei
    Pan, Shifeng
    Soong, Frank K.
    Wen, Zhengqi
    Tao, Jianhua
    INTERSPEECH 2019, 2019, : 1283 - 1287
  • [9] An Investigation of Phrase Break Prediction in an End-to-End TTS System
    Anandaswarup Vadapalli
    SN Computer Science, 6 (2)
  • [10] TOWARDS USING HETEROGENEOUS RELATION GRAPHS FOR END-TO-END TTS
    Setlur, Amrith
    Madaan, Aman
    Parekh, Tanmay
    Yang, Yining
    Black, Alan W.
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 1162 - 1169