ON THE INTERPLAY BETWEEN SPARSITY, NATURALNESS, INTELLIGIBILITY, AND PROSODY IN SPEECH SYNTHESIS

被引：0

作者：

Lai, Cheng-I Jeff ^{[1
,2
]}

Cooper, Erica ^{[3
]}

Zhang, Yang ^{[2
]}

Chang, Shiyu ^{[2
]}

Qian, Kaizhi ^{[2
]}

Liao, Yi-Lun ^{[1
]}

Chuang, Yung-Sung ^{[1
]}

Liu, Alexander H. ^{[1
]}

Yamagishi, Junichi ^{[3
]}

Cox, David ^{[2
]}

Glass, James ^{[1
]}

机构：

[1] MIT, CSAIL, Cambridge, MA 02139 USA

[2] MIT, IBM Watson AI Lab, Cambridge, MA 02139 USA

[3] Natl Inst Informat, Tokyo, Japan

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

text-to-speech; vocoder; speech synthesis; pruning; efficiency;

D O I：

10.1109/ICASSP43922.2022.9747728

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparsity and its subsequent effects on synthetic speech. Additionally, we explore several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by large-scale subjective tests and objective measures. Code and 200 pruned models are made available to facilitate future research on efficiency in TTS1.

引用

页码：8447 / 8451

页数：5

共 50 条

[1] Intelligibility and Naturalness of Speech
Lehner, Katharina
SPRACHE-STIMME-GEHOR, 2022, 46 (01): : 4 - 5
[2] PROSODY IN SPEECH SYNTHESIS - THE INTERPLAY BETWEEN BASIC RESEARCH AND TTS APPLICATION
KOHLER, KJ
JOURNAL OF PHONETICS, 1991, 19 (01) : 121 - 138
[3] THE EFFECT OF RATE CONTROL ON THE INTELLIGIBILITY AND NATURALNESS OF DYSARTHRIC SPEECH
YORKSTON, KM
HAMMEN, VL
BEUKELMAN, DR
TRAYNOR, CD
JOURNAL OF SPEECH AND HEARING DISORDERS, 1990, 55 (03): : 550 - 560
[4] Combining concatenation and formant synthesis for improved intelligibility and naturalness in text-to-speech systems
Pearson S.
International Journal of Speech Technology, 1997, 1 (2) : 103 - 107
[5] Combining concatenation and formant synthesis for improved intelligibility and naturalness in text-to-speech systems
Panasonic Technologies, Inc, Santa Barbara, United States
Int J Speech Technol, 2 (103-107):
[6] The Naturalness of Speech Synthesis
Peng, Hailing
Wang, Feng
INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING BIOMEDICAL ENGINEERING, AND INFORMATICS (SPBEI 2013), 2014, : 722 - 727
[7] Intelligibility is more than a single word:: Quantification of speech intelligibility by ASR and prosody
Maier, Andreas
Haderlein, Tino
Schuster, Maria
Nkenke, Emeka
Noeth, Elmar
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2007, 4629 : 278 - +
[8] Speech intelligibility and prosody production in children with cochlear implants
Chin, Steven B.
Bergeson, Tonya R.
Jennifer Phan
JOURNAL OF COMMUNICATION DISORDERS, 2012, 45 (05) : 355 - 366
[9] IMPROVING NATURALNESS AND INTELLIGIBILITY OF HELIUM SPEECH USING VOCODER TECHNIQUES
GOLDEN, RM
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1966, 39 (06): : 1239 - &
[10] A text-to-speech system with high intelligibility and naturalness for Chinese
CHU Min and LU Shinan(Institute of Acoustics
Chinese Journal of Acoustics, 1996, (01) : 81 - 90

← 1 2 3 4 5 →