LOW-LATENCY INCREMENTAL TEXT-TO-SPEECH SYNTHESIS WITH DISTILLED CONTEXT PREDICTION NETWORK

被引:0
|
作者
Saeki, Takaaki [1 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan
关键词
Incremental text-to-speech synthesis; end-to-end text-to-speech synthesis; knowledge distillation; context estimation; language model;
D O I
10.1109/ASRU51503.2021.9687904
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Incremental text-to-speech (TTS) synthesis generates utterances in small linguistic units for the sake of real-time and low-latency applications. We previously proposed an incremental TTS method that leverages a large pre-trained language model to take unobserved future context into account without waiting for the subsequent segment. Although this method achieves comparable speech quality to that of a method that waits for the future context, it entails a huge amount of processing for sampling from the language model at each time step. In this paper, we propose an incremental TTS method that directly predicts the unobserved future context with a lightweight model, instead of sampling words from the large-scale language model. We perform knowledge distillation from a GPT2-based context prediction network into a simple recurrent model by minimizing a teacher-student loss defined between the context embedding vectors of those models. Experimental results show that the proposed method requires about ten times less inference time to achieve comparable synthetic speech quality to that of our previous method, and it can perform incremental synthesis much faster than the average speaking speed of human English speakers, demonstrating the availability of our method to real-time applications.
引用
收藏
页码:749 / 756
页数:8
相关论文
共 50 条
  • [1] Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis
    Pouget, Mael
    Nahorna, Olha
    Hueber, Thomas
    Bailly, Gerard
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2846 - 2850
  • [2] FACTORIZED CONTEXT MODELLING FOR TEXT-TO-SPEECH SYNTHESIS
    Lu, Heng
    King, Simon
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7849 - 7853
  • [3] Efficient Incremental Text-to-Speech on GPUs
    Du, Muyang
    Liu, Chuan
    Qi, Jiaxing
    Lai, Junjie
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1422 - 1428
  • [4] Incremental Text-to-Speech Synthesis with Prefix-to-Prefix Framework ☆
    Ma, Mingbo
    Zheng, Baigong
    Liu, Kaibo
    Zheng, Renjie
    Liu, Hairong
    Peng, Kainan
    Church, Kenneth
    Huang, Liang
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 3886 - 3896
  • [5] Mandarin Text-to-Speech Front-End With Lightweight Distilled Convolution Network
    Zhao, Wei
    Wang, Zuyi
    Xu, Li
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 249 - 253
  • [6] TEXT-TO-SPEECH SYNTHESIS
    SPROAT, RW
    OLIVE, JP
    [J]. AT&T TECHNICAL JOURNAL, 1995, 74 (02): : 35 - 44
  • [7] Automatic Pitch Accent Prediction for Text-To-Speech Synthesis
    Read, Ian
    Cox, Stephen
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2085 - 2088
  • [8] Quantized HMMs for Low Footprint Text-To-Speech Synthesis
    Gutkin, Alexander
    Gonzalvo, Xavi
    Breuer, Stefan
    Taylor, Paul
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 837 - 840
  • [9] Text and Speech Corpora for Text-To-Speech Synthesis of Tales
    Doukhan, David
    Rosset, Sophie
    Rilliard, Albert
    d'Alessandro, Christophe
    Adda-Decker, Martine
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1003 - 1010
  • [10] Multilingual text-to-speech synthesis
    Black, AW
    Lenzo, KA
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL III, PROCEEDINGS: IMAGE AND MULTIDIMENSIONAL SIGNAL PROCESSING SPECIAL SESSIONS, 2004, : 761 - 764