Investigation of Japanese PnG BERT Language Model in Text-to-Speech Synthesis for Pitch Accent Language

被引：3

作者：

Yasuda, Yusuke ^{[1
]}

Toda, Tomoki ^{[1
]}

机构：

[1] Nagoya Univ, Informat Technol Ctr, Nagoya, Aichi 4648601, Japan

来源：

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING | 2022年 / 16卷 / 06期

关键词：

Bit error rate; Task analysis; Rendering (computer graphics); Feature extraction; Transformers; Syntactics; Predictive models; PnG BERT; text-to-speech; Japanese; pitch accent; self-supervised learning; TACOTRON;

D O I：

10.1109/JSTSP.2022.3190672

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle the challenge of rendering correct pitch accent in Japanese end-to-end TTS, we adopt PnG BERT, a self-supervised pretrained model in the character and phoneme domain for TTS. We investigate the effects of features captured by PnG BERT on Japanese TTS by modifying the fine-tuning condition to determine the conditions helpful inferring pitch accents. We manipulate content of PnG BERT features from being text-oriented to speech-oriented by changing the number of fine-tuned layers during TTS. In addition, we teach PnG BERT pitch accent information by fine-tuning with tone prediction as an additional downstream task. Our experimental results show that the features of PnG BERT captured by pretraining contain information helpful inferring pitch accent, and PnG BERT outperforms baseline Tacotron on accent correctness in a listening test.

引用

页码：1319 / 1328

页数：10

共 50 条

[1] Investigation of Japanese PnG BERT language model in text-to-speech synthesis for pitch accent language
[J]. arXiv, 1600,
[2] INVESTIGATION OF ENHANCED TACOTRON TEXT-TO-SPEECH SYNTHESIS SYSTEMS WITH SELF-ATTENTION FOR PITCH ACCENT LANGUAGE
Yasuda, Yusuke
Wang, Xin
Takaki, Shinji
Yamagishi, Junichi
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6905 - 6909
[3] Automatic Pitch Accent Prediction for Text-To-Speech Synthesis
Read, Ian
Cox, Stephen
[J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2085 - 2088
[4] Including Pitch Accent Optionality in Unit Selection Text-to-Speech Synthesis
Badino, Leonardo
Clark, Robert A. J.
Strom, Volker
[J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2118 - 2121
[5] Text-to-speech synthesis with an Indian language perspective
Panda, Soumya Priyadarsini
Nayak, Ajit Kumar
Patnaik, Srikanta
[J]. INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2015, 6 (3-4) : 170 - 178
[6] Text-To-Speech Synthesis System for Punjabi Language
Singh, Parminder
Lehal, Gurpreet Singh
[J]. INFORMATION SYSTEMS FOR INDIAN LANGUAGES, 2011, 139 : 302 - 303
[7] Text-to-speech for Slovak language
Caky, P
Klimo, M
Mihálik, I
Mladsik, R
[J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2004, 3206 : 291 - 298
[8] Text analysis and language identification for polyglot text-to-speech synthesis
Romsdorfer, Harald
Pfister, Beat
[J]. SPEECH COMMUNICATION, 2007, 49 (09) : 697 - 724
[9] AN ACCENT-UNIT MODEL OF INTONATION FOR TEXT-TO-SPEECH SYNTHESIS
JOHNSON, M
HOUSE, J
[J]. PROCEEDINGS : INSTITUTE OF ACOUSTICS, VOL 8, PART 7: SPEECH & HEARING, 1986, 8 : 409 - 416
[10] TEXT-TO-SPEECH SYNTHESIS: A PROTOTYPE SYSTEM FOR CROATIAN LANGUAGE
Pobar, Miran
Martincic-Ipsic, Sanda
Ipsic, Ivo
[J]. ENGINEERING REVIEW, 2008, 28 (02) : 31 - 44

← 1 2 3 4 5 →