Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

被引：0

作者：

Tu, Tao ^{[1
]}

Chen, Yuan-Jui ^{[1
]}

Liu, Alexander H. ^{[1
]}

Lee, Hung-yi ^{[1
]}

机构：

[1] Natl Taiwan Univ, Coll Elect Engn & Comp Sci, Taipei, Taiwan

来源：

INTERSPEECH 2020 | 2020年

关键词：

multi-speaker speech synthesis; semi-supervised learning; discrete speech representation;

D O I：

10.21437/Interspeech.2020-1824

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain success in the situation where a lot of high-quality speech plus their corresponding transcriptions are available. However, laborious paired data collection processes prevent many institutes from building multi-speaker TTS systems of great performance. In this work, we propose a semi-supervised learning approach for multi-speaker TTS. A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. The experiment results demonstrate that with only an hour of paired speech data, whether the paired data is from multiple speakers or a single speaker, the proposed model can generate intelligible speech in different voices. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy. In addition, our analysis reveals that different speaker characteristics of the paired data have an impact on the effectiveness of semi-supervised TTS.

引用

页码：3191 / 3195

页数：5

共 50 条

[41] Semi-Supervised Learning of Speech Sounds
Jansen, Aren
Niyogi, Partha
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2264 - 2267
[42] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[43] SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
Yoon, Hyungchan
Kim, Changhwan
Um, Seyun
Yoon, Hyun-Wook
Kang, Hong-Goo
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 593 - 597
[44] Frequency Warping for Speaker Adaption of Text-to-speech Synthesis
Gao, Weixun
Cao, Qiying
ICWMMN 2010, PROCEEDINGS, 2010, : 307 - +
[45] GRAPH CONVOLUTIONAL NETWORK BASED SEMI-SUPERVISED LEARNING ON MULTI-SPEAKER MEETING DATA
Tong, Fuchuan
Zheng, Siqi
Zhang, Min
Chen, Yafeng
Suo, Hongbin
Hong, Qingyang
Li, Lin
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6622 - 6626
[46] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
Fujita, Kenichi
Ando, Atsushi
Ijima, Yusuke
INTERSPEECH 2021, 2021, : 3141 - 3145
[47] SPEAKER INTONATION ADAPTATION FOR TRANSFORMING TEXT-TO-SPEECH SYNTHESIS SPEAKER IDENTITY
Langarani, Mahsa Sadat Elyasi
van Santen, Jan
2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 116 - 123
[48] Multi speaker text-to-speech synthesis using generalized end-to-end loss function
Nazir, Owais
Malik, Aruna
Singh, Samayveer
Pathan, Al-Sakib Khan
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 64205 - 64222
[49] Lombard Speech Synthesis using Transfer Learning in a Tacotron Text-to-Speech System
Bollepalli, Bajibabu
Juvela, Lauri
Alku, Paavo
INTERSPEECH 2019, 2019, : 2833 - 2837
[50] LNACont: Language-normalized Affine Coupling Layer with contrastive learning for Cross-lingual Multi-speaker Text-to-speech
Hwang, Sungwoong
Kim, Changhwan
32ND EUROPEAN SIGNAL PROCESSING CONFERENCE, EUSIPCO 2024, 2024, : 391 - 395

← 1 2 3 4 5 →