Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder

被引：0

作者：

Hao, Xiaoyang ^{[1
,2
]}

Zhang, Pengyuan ^{[1
,2
]}

机构：

[1] Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing,100190, China

[2] University of Chinese Academy of Sciences, Beijing,100049, China

来源：

Shengxue Xuebao/Acta Acustica | 2022年 / 47卷 / 03期

关键词：

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Speaker adaption and speaker labels are two common methods for multi-speaker speech synthesis. The model obtained by speaker adaption can only support the speech of the adaptive speaker, and not robust enough. The conventional speaker label needs to obtain the speaker information of speech with supervision, and can't learn the speaker label unsupervised from the speech itself. In order to solve the problems, a variational autoencoder based autoregressive multi-speaker framework is proposed. Firstly, speaker information is learned by variational autoencoder unsupervisedly and encoded into speaker labels. Then, speaker labels together with linguistic features are fed into an autoregressive acoustic model. Besides, acoustic model adopts multi-task learning to avoid the over-fitting of fundamental frequency. Pre-experiment shows, the autoregressive network structure decreases the cepstral distortion by 1.018 dB and root mean square error of fundamental frequency drops 6.861 Hz by multi-task learning. In the following comparative experiments, the Mean Opinion Score (MOS) scores respectively achieve 3.71, 3.55,3.15 and Pinyin Error Rate achieve 6.71%, 7.54%, 9.87% in three sub-tasks in multi-speaker speech synthesis by proposed method, which shows proposed methods observably improve the quality of synthesized speech. © 2022 Acta Acustica.

引用

页码：405 / 416

共 50 条

[1] NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH
Zhao, Botao
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4293 - 4297
[2] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Zheng, Yibin
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
[3] Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation
Mitsui, Kentaro
Koriyama, Tomoki
Saruwatari, Hiroshi
[J]. SPEECH COMMUNICATION, 2021, 132 : 132 - 145
[4] DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding
Lee, Junmo
Song, Kwangsub
Noh, Kyoungjin
Park, Tae-Jun
Chang, Joon-Hyuk
[J]. 2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 61 - 64
[5] LIGHTSPEECH: LIGHTWEIGHT NON-AUTOREGRESSIVE MULTI-SPEAKER TEXT-TO-SPEECH
Li, Song
Ouyang, Beibei
Li, Lin
Hong, Qingyang
[J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 499 - 506
[6] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
Fujita, Kenichi
Ando, Atsushi
Ijima, Yusuke
[J]. INTERSPEECH 2021, 2021, : 3141 - 3145
[7] MULTI-SPEAKER EMOTIONAL ACOUSTIC MODELING FOR CNN-BASED SPEECH SYNTHESIS
Choi, Heejin
Park, Sangjun
Park, Jinuk
Hahn, Minsoo
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6950 - 6954
[8] MULTI-SPEAKER AND MULTI-DOMAIN EMOTIONAL VOICE CONVERSION USING FACTORIZED HIERARCHICAL VARIATIONAL AUTOENCODER
Elgaar, Mohamed
Park, Jungbae
Lee, Sang Wan
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7769 - 7773
[9] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
Fujita, Kenichi
Ando, Atsushi
Ijima, Yusuke
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
[10] AUTOREGRESSIVE VARIATIONAL AUTOENCODER WITH A HIDDEN SEMI-MARKOV MODEL-BASED STRUCTURED ATTENTION FOR SPEECH SYNTHESIS
Fujimoto, Takato
Hashimoto, Kei
Nankaku, Yoshihiko
Tokuda, Keiichi
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7462 - 7466

← 1 2 3 4 5 →