LOW-LATENCY INCREMENTAL TEXT-TO-SPEECH SYNTHESIS WITH DISTILLED CONTEXT PREDICTION NETWORK

被引：0

作者：

Saeki, Takaaki ^{[1
]}

Takamichi, Shinnosuke ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan

来源：

2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU) | 2021年

关键词：

Incremental text-to-speech synthesis; end-to-end text-to-speech synthesis; knowledge distillation; context estimation; language model;

D O I：

10.1109/ASRU51503.2021.9687904

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Incremental text-to-speech (TTS) synthesis generates utterances in small linguistic units for the sake of real-time and low-latency applications. We previously proposed an incremental TTS method that leverages a large pre-trained language model to take unobserved future context into account without waiting for the subsequent segment. Although this method achieves comparable speech quality to that of a method that waits for the future context, it entails a huge amount of processing for sampling from the language model at each time step. In this paper, we propose an incremental TTS method that directly predicts the unobserved future context with a lightweight model, instead of sampling words from the large-scale language model. We perform knowledge distillation from a GPT2-based context prediction network into a simple recurrent model by minimizing a teacher-student loss defined between the context embedding vectors of those models. Experimental results show that the proposed method requires about ten times less inference time to achieve comparable synthetic speech quality to that of our previous method, and it can perform incremental synthesis much faster than the average speaking speed of human English speakers, demonstrating the availability of our method to real-time applications.

引用

页码：749 / 756

页数：8

共 50 条

[1] Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis
Pouget, Mael
Nahorna, Olha
Hueber, Thomas
Bailly, Gerard
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2846 - 2850
[2] FACTORIZED CONTEXT MODELLING FOR TEXT-TO-SPEECH SYNTHESIS
Lu, Heng
King, Simon
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7849 - 7853
[3] Efficient Incremental Text-to-Speech on GPUs
Du, Muyang
Liu, Chuan
Qi, Jiaxing
Lai, Junjie
[J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1422 - 1428
[4] Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework ☆
Ma, Mingbo
Zheng, Baigong
Liu, Kaibo
Zheng, Renjie
Liu, Hairong
Peng, Kainan
Church, Kenneth
Huang, Liang
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 3886 - 3896
[5] Mandarin Text-to-Speech Front-End With Lightweight Distilled Convolution Network
Zhao, Wei
Wang, Zuyi
Xu, Li
[J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 249 - 253
[6] TEXT-TO-SPEECH SYNTHESIS
SPROAT, RW
OLIVE, JP
[J]. AT&T TECHNICAL JOURNAL, 1995, 74 (02): : 35 - 44
[7] Automatic Pitch Accent Prediction for Text-To-Speech Synthesis
Read, Ian
Cox, Stephen
[J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2085 - 2088
[8] Quantized HMMs for Low Footprint Text-To-Speech Synthesis
Gutkin, Alexander
Gonzalvo, Xavi
Breuer, Stefan
Taylor, Paul
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 837 - 840
[9] Text and Speech Corpora for Text-To-Speech Synthesis of Tales
Doukhan, David
Rosset, Sophie
Rilliard, Albert
d'Alessandro, Christophe
Adda-Decker, Martine
[J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1003 - 1010
[10] Multilingual text-to-speech synthesis
Black, AW
Lenzo, KA
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 761 - 764

← 1 2 3 4 5 →