Enhancing Sequence-to-Sequence Text-to-Speech with Morphology

被引:3
|
作者
Taylor, Jason [1 ]
Richmond, Korin [1 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
来源
关键词
Speech Synthesis; Sequence-to-Sequence; Morphology; Pronunciation;
D O I
10.21437/Interspeech.2020-1547
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Neural sequence-to-sequence (S2S) modelling encodes a single, unified representation for each input sequence. When used for text-to-speech synthesis (TTS), such representations must embed ambiguities between English spelling and pronunciation. For example, in pothole and there the character sequence th sounds different. This can be problematic when predicting pronunciation directly from letters. We posit pronunciation becomes easier to predict when letters are grouped into subword units like morphemes (e.g. a boundary lies between t and h in pothole but not there). Moreover, morphological boundaries can reduce the total number of, and increase the counts of, seen unit subsequences. Accordingly, we test here the effect of augmenting input sequences of letters with morphological boundaries. We find morphological boundaries substantially lower the Word and Phone Error Rates (WER and PER) for a Bi-LSTM performing G2P on one hand, and also increase the naturalness scores of Tacotrons performing TTS in a MUSHRA listening test on the other. The improvements to TTS quality are such that grapheme input augmented with morphological boundaries outperforms phone input without boundaries. Since morphological segmentation may be predicted with high accuracy, we highlight this simple pre-processing step has important potential for S2S modelling in TTS.
引用
收藏
页码:1738 / 1742
页数:5
相关论文
共 50 条
  • [1] Detection and analysis of attention errors in sequence-to-sequence text-to-speech
    Valentini-Botinhao, Cassia
    King, Simon
    [J]. INTERSPEECH 2021, 2021, : 2746 - 2750
  • [2] Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
    Yasuda, Yusuke
    Wang, Xin
    Yamagishi, Junichi
    [J]. COMPUTER SPEECH AND LANGUAGE, 2021, 67
  • [3] Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages
    Zhang, Haitong
    Lin, Yue
    [J]. INTERSPEECH 2020, 2020, : 3161 - 3165
  • [4] A UNIFIED SEQUENCE-TO-SEQUENCE FRONT-END MODEL FOR MANDARIN TEXT-TO-SPEECH SYNTHESIS
    Pan, Junjie
    Yin, Xiang
    Zhang, Zhiling
    Liu, Shichao
    Zhang, Yang
    Ma, Zejun
    Wang, Yuxuan
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6689 - 6693
  • [5] Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data
    Fong, Jason
    Gallegos, Pilar Oplustil
    Hodari, Zack
    King, Simon
    [J]. INTERSPEECH 2019, 2019, : 1546 - 1550
  • [6] Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
    Huang, Wen-Chin
    Hayashi, Tomoki
    Wu, Yi-Chiao
    Kameoka, Hirokazu
    Toda, Tomoki
    [J]. INTERSPEECH 2020, 2020, : 4676 - 4680
  • [7] Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
    Zhou, Kun
    Sisman, Berrak
    Li, Haizhou
    [J]. INTERSPEECH 2021, 2021, : 811 - 815
  • [8] Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders
    Okamoto, Takuma
    Toda, Tomoki
    Shiga, Yoshinori
    Kawai, Hisashi
    [J]. INTERSPEECH 2019, 2019, : 1308 - 1312
  • [9] LEVERAGING SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR ENHANCING ACOUSTIC-TO-WORD SPEECH RECOGNITION
    Mimura, Masato
    Ueno, Sei
    Inaguma, Hirofumi
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 477 - 484
  • [10] Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
    Baskar, Murali Karthick
    Watanabe, Shinji
    Astudillo, Ramon
    Hori, Takaaki
    Burget, Lukas
    Cernocky, Jan
    [J]. INTERSPEECH 2019, 2019, : 3790 - 3794