CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS

被引:30
|
作者
Guo, Haohan [1 ,3 ]
Zhang, Shaofei [2 ]
Soong, Frank K. [2 ]
He, Lei [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
[3] Microsoft, Redmond, WA USA
关键词
Text-to-Speech; End-to-End; Conversational TTS; Speech Corpus; Voice Agent;
D O I
10.1109/SLT48900.2021.9383460
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
End-to-end neural TTS has achieved excellent performance on reading style speech synthesis. However, it is still a challenge to build a high-quality conversational TTS due to the limitations of corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-toend TTS approach that employs an auxiliary encoder and a conversational context encoder to specifically reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed approach produces more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors like fillers and repeated words, which makes the conversational speaking style more realistic.
引用
收藏
页码:403 / 409
页数:7
相关论文
共 50 条
  • [21] Voice End-to-End Encrypted for TETRA Radiocommunication System
    Buric, Marian
    PROCEEDINGS OF THE 2010 8TH INTERNATIONAL CONFERENCE ON COMMUNICATIONS (COMM), 2010, : 419 - 422
  • [22] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Pamisetty, Giridhar
    Murty, K. Sri Rama
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2023, 42 (01) : 361 - 384
  • [23] Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
    Cooper, Erica
    Lai, Cheng-, I
    Yasuda, Yusuke
    Yamagishi, Junichi
    INTERSPEECH 2020, 2020, : 3979 - 3983
  • [24] Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control
    Giridhar Pamisetty
    K. Sri Rama Murty
    Circuits, Systems, and Signal Processing, 2023, 42 : 361 - 384
  • [25] An End-to-End TTS Model in Chhattisgarhi, a Low-Resource Indian Language
    Singh, Abhayjeet
    Jayakumar, Anjali
    Deekshitha, G.
    Kumar, Hitesh
    Bandekar, Jesuraja
    Badiger, Sandhya
    Udupa, Sathvik
    Kumar, Saurabh
    Ghosh, Prasanta Kumar
    SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 164 - 172
  • [26] Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS
    Qiang, Chunyu
    Tao, Jianhua
    Fu, Ruibo
    Wen, Zhengqi
    Yi, Jiangyan
    Wang, Tao
    Wang, Shiming
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [27] TriniTTS: Pitch-controllable End-to-end TTS without External Aligner
    Ju, Yoon-Cheol
    Kim, Il-Hwan
    Yang, Hong-Sun
    Kim, Ji-Hoon
    Kim, Byeong-Yeol
    Maiti, Soumi
    Watanabe, Shinji
    INTERSPEECH 2022, 2022, : 16 - 20
  • [28] Non-autoregressive End-to-End TTS with Coarse-to-Fine Decoding
    Wang, Tao
    Liu, Xuefei
    Tao, Jianhua
    Yi, Jiangyan
    Fu, Ruibo
    Wen, Zhengqi
    INTERSPEECH 2020, 2020, : 3984 - 3988
  • [29] Improving End-to-End Neural Diarization Using Conversational Summary Representations
    Broughton, Samuel J.
    Samarakoon, Lahiru
    INTERSPEECH 2023, 2023, : 3157 - 3161
  • [30] Conversational recommendation based on end-to-end learning: How far are we?
    Manzoor, Ahtsham
    Jannach, Dietmar
    COMPUTERS IN HUMAN BEHAVIOR REPORTS, 2021, 4