Investigation of Japanese PnG BERT Language Model in Text-to-Speech Synthesis for Pitch Accent Language

被引:3
|
作者
Yasuda, Yusuke [1 ]
Toda, Tomoki [1 ]
机构
[1] Nagoya Univ, Informat Technol Ctr, Nagoya, Aichi 4648601, Japan
关键词
Bit error rate; Task analysis; Rendering (computer graphics); Feature extraction; Transformers; Syntactics; Predictive models; PnG BERT; text-to-speech; Japanese; pitch accent; self-supervised learning; TACOTRON;
D O I
10.1109/JSTSP.2022.3190672
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle the challenge of rendering correct pitch accent in Japanese end-to-end TTS, we adopt PnG BERT, a self-supervised pretrained model in the character and phoneme domain for TTS. We investigate the effects of features captured by PnG BERT on Japanese TTS by modifying the fine-tuning condition to determine the conditions helpful inferring pitch accents. We manipulate content of PnG BERT features from being text-oriented to speech-oriented by changing the number of fine-tuned layers during TTS. In addition, we teach PnG BERT pitch accent information by fine-tuning with tone prediction as an additional downstream task. Our experimental results show that the features of PnG BERT captured by pretraining contain information helpful inferring pitch accent, and PnG BERT outperforms baseline Tacotron on accent correctness in a listening test.
引用
收藏
页码:1319 / 1328
页数:10
相关论文
共 50 条
  • [2] INVESTIGATION OF ENHANCED TACOTRON TEXT-TO-SPEECH SYNTHESIS SYSTEMS WITH SELF-ATTENTION FOR PITCH ACCENT LANGUAGE
    Yasuda, Yusuke
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6905 - 6909
  • [3] Automatic Pitch Accent Prediction for Text-To-Speech Synthesis
    Read, Ian
    Cox, Stephen
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2085 - 2088
  • [4] Including Pitch Accent Optionality in Unit Selection Text-to-Speech Synthesis
    Badino, Leonardo
    Clark, Robert A. J.
    Strom, Volker
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2118 - 2121
  • [5] Text-to-speech synthesis with an Indian language perspective
    Panda, Soumya Priyadarsini
    Nayak, Ajit Kumar
    Patnaik, Srikanta
    [J]. INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2015, 6 (3-4) : 170 - 178
  • [6] Text-To-Speech Synthesis System for Punjabi Language
    Singh, Parminder
    Lehal, Gurpreet Singh
    [J]. INFORMATION SYSTEMS FOR INDIAN LANGUAGES, 2011, 139 : 302 - 303
  • [7] Text-to-speech for Slovak language
    Caky, P
    Klimo, M
    Mihálik, I
    Mladsik, R
    [J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2004, 3206 : 291 - 298
  • [8] Text analysis and language identification for polyglot text-to-speech synthesis
    Romsdorfer, Harald
    Pfister, Beat
    [J]. SPEECH COMMUNICATION, 2007, 49 (09) : 697 - 724
  • [9] AN ACCENT-UNIT MODEL OF INTONATION FOR TEXT-TO-SPEECH SYNTHESIS
    JOHNSON, M
    HOUSE, J
    [J]. PROCEEDINGS : INSTITUTE OF ACOUSTICS, VOL 8, PART 7: SPEECH & HEARING, 1986, 8 : 409 - 416
  • [10] TEXT-TO-SPEECH SYNTHESIS: A PROTOTYPE SYSTEM FOR CROATIAN LANGUAGE
    Pobar, Miran
    Martincic-Ipsic, Sanda
    Ipsic, Ivo
    [J]. ENGINEERING REVIEW, 2008, 28 (02) : 31 - 44