Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

被引:1
|
作者
Makarov, Peter [1 ]
Abbas, Ammar [1 ]
Lajszczak, Mateusz [1 ]
Joly, Arnaud [1 ]
Karlapati, Sri [1 ]
Moinet, Alexis [1 ]
Drugman, Thomas [1 ]
Karanasou, Penny [1 ]
机构
[1] Amazon, Alexa AI, Cambridge, England
来源
关键词
neural text-to-speech; long-form TTS; multi-speaker TTS; contextual word embeddings; FastSpeech; BERT;
D O I
10.21437/Interspeech.2022-379
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on multi-speaker data all improve prosody. More interestingly, they result in synergies. Long context disambiguates prosody, improves coherence, and plays to the strengths of Transformers. Finetuning word-level features from a powerful language model, such as BERT, appears to benefit from more training data, readily available in a multi-speaker setting. We look into objective metrics on pausing and pacing and perform thorough subjective evaluations for speech naturalness. Our main system, which incorporates all the extensions, achieves consistently strong results, including statistically significant improvements in speech naturalness over all its competitors.
引用
收藏
页码:3368 / 3372
页数:5
相关论文
共 50 条
  • [31] Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization
    Shang, Guokan
    Ding, Wensi
    Zhang, Zekun
    Tixier, Antoine J. -P.
    Meladianos, Polykarpos
    Vazirgiannis, Michalis
    Lorre, Jean-Pierre
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 664 - 674
  • [32] Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning
    Mao, Yuzhao
    Zhou, Chang
    Wang, Xiaojie
    Li, Ruifan
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4258 - 4264
  • [33] Reading during the composition of multi-sentence texts: an eye-movement study
    Torrance, Mark
    Johansson, Roger
    Johansson, Victoria
    Wengelin, Asa
    PSYCHOLOGICAL RESEARCH-PSYCHOLOGISCHE FORSCHUNG, 2016, 80 (05): : 729 - 743
  • [34] Multi-Sentence Matching via Exploiting List-level Semantics Expansion
    Sun, Ruijun
    Li, Zhi
    Liu, Qi
    Wang, Zhefeng
    Duan, Xinyu
    Huai, Baoxing
    Yuan, Nicholas Jing
    2022 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH (ICKG), 2022, : 249 - 256
  • [35] Reading during the composition of multi-sentence texts: an eye-movement study
    Mark Torrance
    Roger Johansson
    Victoria Johansson
    Åsa Wengelin
    Psychological Research, 2016, 80 : 729 - 743
  • [36] Multi-granularity Semantic and Acoustic Stress Prediction for Expressive TTS
    Chi, Wenjiang
    Feng, Xiaoqin
    Xue, Liumeng
    Chen, Yunlin
    Xie, Lei
    Li, Zhifei
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 2409 - 2415
  • [37] Narrator or Character: Voice Modulation in an Expressive Multi-speaker TTS
    Kalyan, T. Pavan
    Rao, Preeti
    Jyothi, Preethi
    Bhattacharyya, Pushpak
    INTERSPEECH 2023, 2023, : 4808 - 4812
  • [38] Constrained BERT BiLSTM CRF for understanding multi-sentence entity-seeking questions
    Contractor, Danish
    Patra, Barun
    Mausam
    Singla, Parag
    NATURAL LANGUAGE ENGINEERING, 2021, 27 (01) : 65 - 87
  • [39] Recognition of multi-sentence n-ary subcellular localization mentions in biomedical abstracts
    School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
    CEUR Workshop Proc., (2.1-2.17):
  • [40] Multi-Sentence Compression-Construct knowledge using paraphrased text and Vertical Crawling
    Kedar, Aashay
    Parikh, Noopur
    Shah, Rinkal
    Kurhade, Swapnali
    PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), 2016, : 2367 - 2370