CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS

被引:30
|
作者
Guo, Haohan [1 ,3 ]
Zhang, Shaofei [2 ]
Soong, Frank K. [2 ]
He, Lei [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
[3] Microsoft, Redmond, WA USA
关键词
Text-to-Speech; End-to-End; Conversational TTS; Speech Corpus; Voice Agent;
D O I
10.1109/SLT48900.2021.9383460
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-toend TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.
引用
收藏
页码:403 / 409
页数:7
相关论文
共 50 条
  • [31] E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications
    Liang, Zheng
    Ma, Ziyang
    Du, Chenpeng
    Yu, Kai
    Chen, Xie
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4810 - 4821
  • [32] End-to-End Spoken Language Understanding for Generalized Voice Assistants
    Saxon, Michael
    Choudhary, Samridhi
    McKenna, Joseph P.
    Mouchtaris, Athanasios
    INTERSPEECH 2021, 2021, : 4738 - 4742
  • [33] End-to-end network performance measurement for voice transmission feasibility
    Kampichler, W
    Goeschka, KM
    COMPUTERS AND THEIR APPLICATIONS, 2001, : 501 - 504
  • [34] Comparing NLP Solutions for the Disambiguation of French Heterophonic Homographs for End-to-End TTS Systems
    Hajj, Maria-Loulou
    Lenglet, Martin
    Perrotin, Olivier
    Bailly, Gerard
    SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 265 - 278
  • [35] Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    INTERSPEECH 2022, 2022, : 4581 - 4585
  • [36] IMPROVING VOICE SEPARATION BY INCORPORATING END-TO-END SPEECH RECOGNITION
    Takahashi, Naoya
    Singh, Mayank Kumar
    Basak, Sakya
    Sudarsanam, Parthasaarathy
    Ganapathy, Sriram
    Mitsufuji, Yuki
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 41 - 45
  • [37] NVC-NET: END-TO-END ADVERSARIAL VOICE CONVERSION
    Nguyen, Bac
    Cardinaux, Fabien
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7012 - 7016
  • [38] End-to-end Domain-Adversarial Voice Activity Detection
    Lavechin, Marvin
    Gill, Marie-Philippe
    Bousbib, Ruben
    Bredin, Herve
    Garcia-Perera, Leibny Paola
    INTERSPEECH 2020, 2020, : 3685 - 3689
  • [39] Novel end-to-end voice encryption method in GSM system
    Qi, H. F.
    Yang, X. H.
    Jiang, R.
    Liang, B.
    Zhou, S. J.
    PROCEEDINGS OF 2008 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL, VOLS 1 AND 2, 2008, : 217 - 220
  • [40] SVSNet: An End-to-End Speaker Voice Similarity Assessment Model
    Hu, Cheng-Hung
    Peng, Yu-Huai
    Yamagishi, Junichi
    Tsao, Yu
    Wang, Hsin-Min
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 767 - 771