Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

被引：0

作者：

Tu, Tao ^{[1
]}

Chen, Yuan-Jui ^{[1
]}

Liu, Alexander H. ^{[1
]}

Lee, Hung-yi ^{[1
]}

机构：

[1] Natl Taiwan Univ, Coll Elect Engn & Comp Sci, Taipei, Taiwan

来源：

INTERSPEECH 2020 | 2020年

关键词：

multi-speaker speech synthesis; semi-supervised learning; discrete speech representation;

D O I：

10.21437/Interspeech.2020-1824

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain success in the situation where a lot of high-quality speech plus their corresponding transcriptions are available. However, laborious paired data collection processes prevent many institutes from building multi-speaker TTS systems of great performance. In this work, we propose a semi-supervised learning approach for multi-speaker TTS. A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. The experiment results demonstrate that with only an hour of paired speech data, whether the paired data is from multiple speakers or a single speaker, the proposed model can generate intelligible speech in different voices. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy. In addition, our analysis reveals that different speaker characteristics of the paired data have an impact on the effectiveness of semi-supervised TTS.

引用

页码：3191 / 3195

页数：5

共 50 条

[21] LIGHT-TTS: LIGHTWEIGHT MULTI-SPEAKER MULTI-LINGUAL TEXT-TO-SPEECH
Li, Song
Ouyang, Beibei
Li, Lin
Hong, Qingyang
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8383 - 8387
[22] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
Chien, Chung-Ming
Lin, Jheng-Hao
Huang, Chien-yu
Hsu, Po-chun
Lee, Hung-yi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592
[23] Multi-Task Adversarial Training Algorithm for Multi-Speaker Neural Text-to-Speech
Nakai, Yusuke
Saito, Yuki
Udagawa, Kenta
Saruwatari, Hiroshi
PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 743 - 748
[24] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
Bang, Chae-Woon
Chun, Chanjun
SENSORS, 2023, 23 (23)
[25] MultiSpeech: Multi-Speaker Text to Speech with Transformer
Chen, Mingjian
Tan, Xu
Ren, Yi
Xu, Jin
Sun, Hao
Zhao, Sheng
Qin, Tao
INTERSPEECH 2020, 2020, : 4024 - 4028
[26] MnTTS2: An Open-Source Multi-speaker Mongolian Text-to-Speech Synthesis Dataset
Liang, Kailin
Liu, Bin
Hu, Yifan
Liu, Rui
Bao, Feilong
Gao, Guanglai
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 318 - 329
[27] ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS
Cooper, Erica
Lai, Cheng-, I
Yasuda, Yusuke
Fang, Fuming
Wang, Xin
Chen, Nanxin
Yamagishi, Junichi
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6184 - 6188
[28] A Controllable Multi-Lingual Multi-Speaker Multi-Style Text-to-Speech Synthesis With Multivariate Information Minimization
Cheon, Sung Jun
Choi, Byoung Jin
Kim, Minchan
Lee, Hyeonseung
Kim, Nam Soo
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 55 - 59
[29] Adapter-Based Extension of Multi-Speaker Text-To-Speech Model for New Speakers
Hsieh, Cheng-Ping
Ghosh, Subhankar
Ginsburg, Boris
INTERSPEECH 2023, 2023, : 3028 - 3032
[30] Adapitch: Adaption Multi-Speaker Text-to-Speech Conditioned on Pitch Disentangling with Untranscribed Data
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
2022 18TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN, 2022, : 456 - 460

← 1 2 3 4 5 →