Enhancing Sequence-to-Sequence Text-to-Speech with Morphology

被引：3

作者：

Taylor, Jason ^{[1
]}

Richmond, Korin ^{[1
]}

机构：

[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

来源：

INTERSPEECH 2020 | 2020年

关键词：

Speech Synthesis; Sequence-to-Sequence; Morphology; Pronunciation;

D O I：

10.21437/Interspeech.2020-1547

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Neural sequence-to-sequence (S2S) modelling encodes a single, unified representation for each input sequence. When used for text-to-speech synthesis (TTS), such representations must embed ambiguities between English spelling and pronunciation. For example, in pothole and there the character sequence th sounds different. This can be problematic when predicting pronunciation directly from letters. We posit pronunciation becomes easier to predict when letters are grouped into subword units like morphemes (e.g. a boundary lies between t and h in pothole but not there). Moreover, morphological boundaries can reduce the total number of, and increase the counts of, seen unit subsequences. Accordingly, we test here the effect of augmenting input sequences of letters with morphological boundaries. We find morphological boundaries substantially lower the Word and Phone Error Rates (WER and PER) for a Bi-LSTM performing G2P on one hand, and also increase the naturalness scores of Tacotrons performing TTS in a MUSHRA listening test on the other. The improvements to TTS quality are such that grapheme input augmented with morphological boundaries outperforms phone input without boundaries. Since morphological segmentation may be predicted with high accuracy, we highlight this simple pre-processing step has important potential for S2S modelling in TTS.

引用

页码：1738 / 1742

页数：5

共 50 条

[1] Detection and analysis of attention errors in sequence-to-sequence text-to-speech
Valentini-Botinhao, Cassia
King, Simon
[J]. INTERSPEECH 2021, 2021, : 2746 - 2750
[2] Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
Yasuda, Yusuke
Wang, Xin
Yamagishi, Junichi
[J]. COMPUTER SPEECH AND LANGUAGE, 2021, 67
[3] Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages
Zhang, Haitong
Lin, Yue
[J]. INTERSPEECH 2020, 2020, : 3161 - 3165
[4] A UNIFIED SEQUENCE-TO-SEQUENCE FRONT-END MODEL FOR MANDARIN TEXT-TO-SPEECH SYNTHESIS
Pan, Junjie
Yin, Xiang
Zhang, Zhiling
Liu, Shichao
Zhang, Yang
Ma, Zejun
Wang, Yuxuan
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6689 - 6693
[5] Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data
Fong, Jason
Gallegos, Pilar Oplustil
Hodari, Zack
King, Simon
[J]. INTERSPEECH 2019, 2019, : 1546 - 1550
[6] Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining
Huang, Wen-Chin
Hayashi, Tomoki
Wu, Yi-Chiao
Kameoka, Hirokazu
Toda, Tomoki
[J]. INTERSPEECH 2020, 2020, : 4676 - 4680
[7] Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
Zhou, Kun
Sisman, Berrak
Li, Haizhou
[J]. INTERSPEECH 2021, 2021, : 811 - 815
[8] Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders
Okamoto, Takuma
Toda, Tomoki
Shiga, Yoshinori
Kawai, Hisashi
[J]. INTERSPEECH 2019, 2019, : 1308 - 1312
[9] LEVERAGING SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR ENHANCING ACOUSTIC-TO-WORD SPEECH RECOGNITION
Mimura, Masato
Ueno, Sei
Inaguma, Hirofumi
Sakai, Shinsuke
Kawahara, Tatsuya
[J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 477 - 484
[10] Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
Baskar, Murali Karthick
Watanabe, Shinji
Astudillo, Ramon
Hori, Takaaki
Burget, Lukas
Cernocky, Jan
[J]. INTERSPEECH 2019, 2019, : 3790 - 3794

← 1 2 3 4 5 →