Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

被引:1
|
作者
Makarov, Peter [1 ]
Abbas, Ammar [1 ]
Lajszczak, Mateusz [1 ]
Joly, Arnaud [1 ]
Karlapati, Sri [1 ]
Moinet, Alexis [1 ]
Drugman, Thomas [1 ]
Karanasou, Penny [1 ]
机构
[1] Amazon, Alexa AI, Cambridge, England
来源
关键词
neural text-to-speech; long-form TTS; multi-speaker TTS; contextual word embeddings; FastSpeech; BERT;
D O I
10.21437/Interspeech.2022-379
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on multi-speaker data all improve prosody. More interestingly, they result in synergies. Long context disambiguates prosody, improves coherence, and plays to the strengths of Transformers. Finetuning word-level features from a powerful language model, such as BERT, appears to benefit from more training data, readily available in a multi-speaker setting. We look into objective metrics on pausing and pacing and perform thorough subjective evaluations for speech naturalness. Our main system, which incorporates all the extensions, achieves consistently strong results, including statistically significant improvements in speech naturalness over all its competitors.
引用
收藏
页码:3368 / 3372
页数:5
相关论文
共 50 条
  • [1] Generation of Coherent Multi-Sentence Texts with a Coherence Mechanism
    Zhao, Qingjuan
    Niu, Jianwei
    Liu, Xuefeng
    He, Wenbo
    Tang, Shaojie
    COMPUTER SPEECH AND LANGUAGE, 2023, 78
  • [2] Coherent Multi-sentence Video Description with Variable Level of Detail
    Rohrbach, Anna
    Rohrbach, Marcus
    Qiu, Wei
    Friedrich, Annemarie
    Pinkal, Manfred
    Schiele, Bernt
    PATTERN RECOGNITION, GCPR 2014, 2014, 8753 : 184 - 196
  • [3] Segmentation of Multi-Sentence Questions: Towards Effective Question Retrieval in cQA Services
    Wang, Kai
    Ming, Zhao-Yan
    Hu, Xia
    Chua, Tat-Seng
    SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 387 - 394
  • [4] Unsupervised Rewriter for Multi-Sentence Compression
    Zhao, Yang
    Shen, Xiaoyu
    Bi, Wei
    Aizawa, Akiko
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2235 - 2240
  • [5] Parse Thicket Representation for Multi-sentence Search
    Galitsky, Boris A.
    Kuznetsov, Sergei O.
    Usikov, Daniel
    CONCEPTUAL STRUCTURES FOR STEM RESEARCH AND EDUCATION, ICCS 2013, 2013, 7735 : 153 - 172
  • [6] Adversarial Inference for Multi-Sentence Video Description
    Park, Jae Sung
    Rohrbach, Marcus
    Darrell, Trevor
    Rohrbach, Anna
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6591 - 6601
  • [7] DOCAMR: Multi-Sentence AMR Representation and Evaluation
    Naseem, Tahira
    Blodgett, Austin
    Kumaravel, Sadhana
    O'Gorman, Tim
    Lee, Young-Suk
    Flanigan, Jeffrey
    Astudillo, Ramon Fernandez
    Florian, Radu
    Roukos, Salim
    Schneider, Nathan
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3496 - 3505
  • [8] Sentence Mover's Similarity: Automatic Evaluation for Multi-Sentence Texts
    Clark, Elizabeth
    Celikyilmaz, Asli
    Smith, Noah A.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2748 - 2760
  • [9] Implicit and explicit commonsense for multi-sentence video captioning
    Chou, Shih-Han
    Little, James J.
    Sigal, Leonid
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 247
  • [10] COHERENT MODIFICATION OF PITCH AND ENERGY FOR EXPRESSIVE PROSODY IMPLANTATION
    Sorin, Alexander
    Shechtman, Slava
    Pollet, Vincent
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4914 - 4918