CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS

被引:18
|
作者
Guo, Haohan [1 ,3 ]
Zhang, Shaofei [2 ]
Soong, Frank K. [2 ]
He, Lei [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
[3] Microsoft, Redmond, WA USA
关键词
Text-to-Speech; End-to-End; Conversational TTS; Speech Corpus; Voice Agent;
D O I
10.1109/SLT48900.2021.9383460
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-toend TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.
引用
收藏
页码:403 / 409
页数:7
相关论文
共 50 条
  • [21] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
    Cooper, Erica
    Lai, Cheng-, I
    Yasuda, Yusuke
    Yamagishi, Junichi
    [J]. INTERSPEECH 2020, 2020, : 3979 - 3983
  • [22] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Giridhar Pamisetty
    K. Sri Rama Murty
    [J]. Circuits, Systems, and Signal Processing, 2023, 42 : 361 - 384
  • [23] Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS
    Qiang, Chunyu
    Tao, Jianhua
    Fu, Ruibo
    Wen, Zhengqi
    Yi, Jiangyan
    Wang, Tao
    Wang, Shiming
    [J]. 2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [24] An End-to-End TTS Model in Chhattisgarhi, a Low-Resource Indian Language
    Singh, Abhayjeet
    Jayakumar, Anjali
    Deekshitha, G.
    Kumar, Hitesh
    Bandekar, Jesuraja
    Badiger, Sandhya
    Udupa, Sathvik
    Kumar, Saurabh
    Ghosh, Prasanta Kumar
    [J]. SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 164 - 172
  • [25] TriniTTS: Pitch-controllable End-to-end TTS without External Aligner
    Ju, Yoon-Cheol
    Kim, Il-Hwan
    Yang, Hong-Sun
    Kim, Ji-Hoon
    Kim, Byeong-Yeol
    Maiti, Soumi
    Watanabe, Shinji
    [J]. INTERSPEECH 2022, 2022, : 16 - 20
  • [26] Non-autoregressive End-to-End TTS with Coarse-to-Fine Decoding
    Wang, Tao
    Liu, Xuefei
    Tao, Jianhua
    Yi, Jiangyan
    Fu, Ruibo
    Wen, Zhengqi
    [J]. INTERSPEECH 2020, 2020, : 3984 - 3988
  • [27] Conversational recommendation based on end-to-end learning: How far are we?
    Manzoor, Ahtsham
    Jannach, Dietmar
    [J]. COMPUTERS IN HUMAN BEHAVIOR REPORTS, 2021, 4
  • [28] E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications
    Liang, Zheng
    Ma, Ziyang
    Du, Chenpeng
    Yu, Kai
    Chen, Xie
    [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2024, 32 : 4810 - 4821
  • [29] End-to-End Spoken Language Understanding for Generalized Voice Assistants
    Saxon, Michael
    Choudhary, Samridhi
    McKenna, Joseph P.
    Mouchtaris, Athanasios
    [J]. INTERSPEECH 2021, 2021, : 4738 - 4742
  • [30] End-to-end network performance measurement for voice transmission feasibility
    Kampichler, W
    Goeschka, KM
    [J]. COMPUTERS AND THEIR APPLICATIONS, 2001, : 501 - 504