Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis

被引：3

作者：

Fujita, Kenichi ^{[1
]}

Ando, Atsushi ^{[1
]}

Ijima, Yusuke ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

来源：

INTERSPEECH 2021 | 2021年

关键词：

speaker embedding; phoneme duration; speech synthesis; speaking rhythm;

D O I：

10.21437/Interspeech.2021-826

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper proposes a novel speech-rhythm-based method for speaker embeddings. Conventionally spectral feature-based speaker embedding vectors such as the x-vector are used as auxiliary information for multi-speaker speech synthesis. However, speech synthesis with conventional embeddings has difficulty reproducing the target speaker's speech rhythm, one of the important factors among speaker characteristics, because spectral features do not explicitly include speech rhythm. In this paper, speaker embeddings that take speech rhythm information into account are introduced to achieve phoneme duration modeling using a few utterances by the target speaker. A novel point of the proposed method is that rhythm-based embeddings are extracted with phonemes and their durations. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted two experiments: speaker embeddings generation and speech synthesis with generated embeddings. We show that the proposed model has an EER of 10.3% in speaker identification even with only speech rhythm. Visualizing the embeddings shows that utterances with similar rhythms are also similar in their speaker embeddings. The results of an objective and subjective evaluation on speech synthesis demonstrate that the proposed method can synthesize speech with speech rhythm closer to the target speaker.

引用

页码：3141 / 3145

页数：5

共 50 条

[1] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
Fujita, Kenichi
Ando, Atsushi
Ijima, Yusuke
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
[2] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Zheng, Yibin
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
[3] Unsupervised Discovery of Phoneme Boundaries in Multi-Speaker Continuous Speech
Armstrong, Tom
Antetomaso, Stephanie
[J]. 2011 IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING (ICDL), 2011,
[4] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
Denisov, Pavel
Ngoc Thang Vu
[J]. INTERSPEECH 2019, 2019, : 4425 - 4429
[5] MULTI-SPEAKER EMOTIONAL ACOUSTIC MODELING FOR CNN-BASED SPEECH SYNTHESIS
Choi, Heejin
Park, Sangjun
Park, Jinuk
Hahn, Minsoo
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6950 - 6954
[6] Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation
Mitsui, Kentaro
Koriyama, Tomoki
Saruwatari, Hiroshi
[J]. SPEECH COMMUNICATION, 2021, 132 : 132 - 145
[7] DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding
Lee, Junmo
Song, Kwangsub
Noh, Kyoungjin
Park, Tae-Jun
Chang, Joon-Hyuk
[J]. 2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 61 - 64
[8] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
Das, Rohan Kumar
Yang, Jichen
Li, Haizhou
[J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
[9] MULTI-SPEAKER EMOTIONAL SPEECH SYNTHESIS WITH FINE-GRAINED PROSODY MODELING
Lu, Chunhui
Wen, Xue
Liu, Ruolan
Chen, Xiao
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5729 - 5733
[10] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
Chen, Mengnan
Chen, Minchuan
Liang, Shuang
Ma, Jun
Chen, Lei
Wang, Shaojun
Xiao, Jing
[J]. INTERSPEECH 2019, 2019, : 2105 - 2109

← 1 2 3 4 5 →