Effective Data Augmentation Methods for Neural Text-to-Speech Systems

被引:0
|
作者
Oh, Suhyeon [1 ]
Kwon, Ohsung [1 ]
Hwang, Min-Jae [1 ]
Kim, Jae-Min [1 ]
Song, Eunwoo [1 ]
机构
[1] NAVER Corp, Seongnam, South Korea
关键词
speech synthesis; self-augmentation; ranking support vector machine;
D O I
10.1109/ICEIC54506.2022.9748515
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper proposes an effective self-augmentation method for improving the quality of neural text-to-speech (TTS) systems. As synthetic speech quality has been greatly improved, creating a neural TTS system using synthetic corpora is now possible. However, whether increasing the amount of synthetic data is always beneficial for improving training efficiency has not been verified. Our aim in this study is to selectively choose synthetic data whose characteristics are close to those of natural speech. Specifically, we adopt a ranking support vector machine (RankSVM) that is well known for effectively ranking relative attributes among binary classes. By setting the synthetic and recorded corpora as two opposite classes, RankSVM is used to determine how the synthesized speech is acoustically similar with the recorded data. As training data can be selectively chosen from large-scale synthetic corpora, the performance of the TTS model re-trained by those data is significantly improved. Subjective evaluation results verify that the proposed TTS model performs much better than the original model trained with recorded data alone and the similarly configured system re-trained with all the synthetic data without any selection method.
引用
收藏
页数:4
相关论文
共 50 条
  • [41] Development of Prototype Text-to-Speech Systems for Northern Sotho
    Oosthuizen, H. J.
    Phihlela, S. T.
    Manamela, M. J. D.
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1348 - 1351
  • [42] EVALUATING TEXT-TO-SPEECH SYSTEMS - SOME METHODOLOGICAL ASPECTS
    VANBEZOOIJEN, R
    POLS, LCW
    SPEECH COMMUNICATION, 1990, 9 (04) : 263 - 270
  • [43] Syllable duration prediction for Farsi text-to-speech systems
    Nazari, B.
    Nayebi, K.
    Sheikhzadeh, H.
    Scientia Iranica, 2004, 11 (03) : 225 - 233
  • [44] NORMALIZATION OF TEXT MESSAGES FOR TEXT-TO-SPEECH
    Pennell, Deana L.
    Liu, Yang
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4842 - 4845
  • [45] Constructing text-to-speech systems for languages with unknown pronunciations
    Sawada, Kei
    Hashimoto, Kei
    Oura, Keiichiro
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2018, 39 (02) : 119 - 129
  • [46] Control of intonation in HMM based text-to-speech systems
    Cai, L. (clh-dcs@tsinghua.edu.cn), 1600, Tsinghua University (53):
  • [47] Text and Speech Corpora for Text-To-Speech Synthesis of Tales
    Doukhan, David
    Rosset, Sophie
    Rilliard, Albert
    d'Alessandro, Christophe
    Adda-Decker, Martine
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1003 - 1010
  • [48] Comparison of measures of speech quality for listening tests of text-to-speech systems
    Viswanathan, M
    Viswanathan, M
    PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, 2002, : 11 - 14
  • [49] INTELLIGIBILITY OF SPEECH PRODUCED BY TEXT-TO-SPEECH SYSTEMS IN GOOD AND TELEPHONIC CONDITIONS
    DELOGU, C
    PAOLONI, A
    RIDOLFI, P
    VAGGES, K
    ACTA ACUSTICA, 1995, 3 (01): : 89 - 96
  • [50] Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
    Vich, Robert
    Nouza, Jan
    Vondra, Martin
    VERBAL AND NONVERBAL FEATURES OF HUMAN-HUMAN AND HUMAN-MACHINE INTERACTIONS, 2008, 5042 : 136 - +