Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

被引：0

作者：

Fujita, Kenichi ^{[1
]}

Ando, Atsushi ^{[1
]}

Ijima, Yusuke ^{[1
]}

机构：

[1] NTT Corp, NTT Human Informat Labs, Yokosuka 2390847, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2024年 / E107D卷 / 01期

关键词：

speaker embedding; phoneme duration; speech synthesis; speech rhythm;

D O I：

10.1587/transinf.2023EDP7039

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for re-producing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker em-beddings generation, speech synthesis with generated embeddings, and em-bedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation anal-ysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.

引用

页码：93 / 104

页数：12

共 50 条

[1] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
Fujita, Kenichi
Ando, Atsushi
Ijima, Yusuke
[J]. INTERSPEECH 2021, 2021, : 3141 - 3145
[2] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
Fu, Ruibo
Tao, Jianhua
Wen, Zhengqi
Zheng, Yibin
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
[3] Unsupervised Discovery of Phoneme Boundaries in Multi-Speaker Continuous Speech
Armstrong, Tom
Antetomaso, Stephanie
[J]. 2011 IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING (ICDL), 2011,
[4] Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation
Mitsui, Kentaro
Koriyama, Tomoki
Saruwatari, Hiroshi
[J]. SPEECH COMMUNICATION, 2021, 132 : 132 - 145
[5] DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding
Lee, Junmo
Song, Kwangsub
Noh, Kyoungjin
Park, Tae-Jun
Chang, Joon-Hyuk
[J]. 2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 61 - 64
[6] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
Denisov, Pavel
Ngoc Thang Vu
[J]. INTERSPEECH 2019, 2019, : 4425 - 4429
[7] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
Das, Rohan Kumar
Yang, Jichen
Li, Haizhou
[J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
[8] An Unsupervised Method to Select a Speaker Subset from Large Multi-Speaker Speech Synthesis Datasets
Gallegos, Pilar Oplustil
Williams, Jennifer
Rownicka, Joanna
King, Simon
[J]. INTERSPEECH 2020, 2020, : 1758 - 1762
[9] MultiSpeech: Multi-Speaker Text to Speech with Transformer
Chen, Mingjian
Tan, Xu
Ren, Yi
Xu, Jin
Sun, Hao
Zhao, Sheng
Qin, Tao
[J]. INTERSPEECH 2020, 2020, : 4024 - 4028
[10] ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH WITH STATE-OF-THE-ART NEURAL SPEAKER EMBEDDINGS
Cooper, Erica
Lai, Cheng-, I
Yasuda, Yusuke
Fang, Fuming
Wang, Xin
Chen, Nanxin
Yamagishi, Junichi
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6184 - 6188

← 1 2 3 4 5 →