Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation

被引：4

作者：

Mitsui, Kentaro ^{[1
]}

Koriyama, Tomoki ^{[1
]}

Saruwatari, Hiroshi ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, 7-3-1 Hongo, Tokyo 1138656, Japan

来源：

SPEECH COMMUNICATION | 2021年 / 132卷

关键词：

Text-to-speech synthesis; Multi-speaker modeling; Speaker representation; Gaussian process; Deep generative models;

D O I：

10.1016/j.specom.2021.07.001

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes deep Gaussian process (DGP)-based frameworks for multi-speaker speech synthesis and speaker representation learning. A DGP has a deep architecture of Bayesian kernel regression, and it has been reported that DGP-based single speaker speech synthesis outperforms deep neural network (DNN)-based ones in the framework of statistical parametric speech synthesis. By extending this method to multiple speakers, it is expected that higher speech quality can be achieved with a smaller number of training utterances from each speaker. To apply DGPs to multi-speaker speech synthesis, we propose two methods: one using DGP with one-hot speaker codes, and the other using a deep Gaussian process latent variable model (DGPLVM). The DGP with one-hot speaker codes uses additional GP layers to transform speaker codes into latent speaker representations. The DGPLVM directly models the distribution of latent speaker representations and learns it jointly with acoustic model parameters. In this method, acoustic speaker similarity is expressed in terms of the similarity of the speaker representations, and thus, the voices of similar speakers are efficiently modeled. We experimentally evaluated the performance of the proposed methods in comparison with those of conventional DNN and variational autoencoder (VAE)-based frameworks, in terms of acoustic feature distortion and subjective speech quality. The experimental results demonstrate that (1) the proposed DGP-based and DGPLVM-based methods improve subjective speech quality compared with a feed-forward DNN-based method, and (2) even when the amount of training data for target speakers is limited, the DGPLVM-based method outperforms other methods, including the VAE-based one. Additionally, (3) by using a speaker representation randomly sampled from the learned speaker space, the DGPLVM-based method can generate voices of non-existent speakers.

引用

页码：132 / 145

页数：14

共 50 条

[1] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
Mitsui, Kentaro
Koriyama, Tomoki
Saruwatari, Hiroshi
[J]. INTERSPEECH 2020, 2020, : 2032 - 2036
[2] DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding
Lee, Junmo
Song, Kwangsub
Noh, Kyoungjin
Park, Tae-Jun
Chang, Joon-Hyuk
[J]. 2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 61 - 64
[3] An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis
Lorincz, Beata
Stan, Adriana
Giurgiu, Mircea
[J]. KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 756 - 765
[4] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
Das, Rohan Kumar
Yang, Jichen
Li, Haizhou
[J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
[5] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
Fujita, Kenichi
Ando, Atsushi
Ijima, Yusuke
[J]. INTERSPEECH 2021, 2021, : 3141 - 3145
[6] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Zheng, Yibin
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
[7] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
Fujita, Kenichi
Ando, Atsushi
Ijima, Yusuke
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
[8] Perceptual-Similarity-Aware Deep Speaker Representation Learning for Multi-Speaker Generative Modeling
Saito, Yuki
Takamichi, Shinnosuke
Saruwatari, Hiroshi
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1033 - 1048
[9] Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder
Hao, Xiaoyang
Zhang, Pengyuan
[J]. Shengxue Xuebao/Acta Acustica, 2022, 47 (03): : 405 - 416
[10] MULTI-SPEAKER EMOTIONAL ACOUSTIC MODELING FOR CNN-BASED SPEECH SYNTHESIS
Choi, Heejin
Park, Sangjun
Park, Jinuk
Hahn, Minsoo
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6950 - 6954

← 1 2 3 4 5 →