Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

被引：0

作者：

Fujita, Kenichi ^{[1
]}

Ando, Atsushi ^{[1
]}

Ijima, Yusuke ^{[1
]}

机构：

[1] NTT Corp, NTT Human Informat Labs, Yokosuka 2390847, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2024年 / E107D卷 / 01期

关键词：

speaker embedding; phoneme duration; speech synthesis; speech rhythm;

D O I：

10.1587/transinf.2023EDP7039

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for re-producing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker em-beddings generation, speech synthesis with generated embeddings, and em-bedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation anal-ysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.

引用

页码：93 / 104

页数：12

共 50 条

[41] Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features
Dong, Zhongping
Xu, Yan
Abel, Andrew
Wang, Dong
[J]. APPLIED SCIENCES-BASEL, 2024, 14 (02):
[42] The Effects of Phoneme Errors in Speaker Adaptation for HMM Speech Synthesis
Toth, Balint
Fegyo, Tibor
Nemeth, Geza
[J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2816 - +
[43] Gender-Dependent Babble Maskers Created from Multi-Speaker Speech for Speech Privacy Protection
Kondo, Kazuhiro
Sakurai, Hiroki
[J]. 2014 TENTH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING (IIH-MSP 2014), 2014, : 251 - 254
[44] Silent versus modal multi-speaker speech recognition from ultrasound and video
Ribeiro, Manuel Sam
Eshky, Aciel
Richmond, Korin
Renals, Steve
[J]. INTERSPEECH 2021, 2021, : 641 - 645
[45] MULTI-SPEAKER SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR DATA AUGMENTATION IN ACOUSTIC-TO-WORD SPEECH RECOGNITION
Ueno, Sei
Mimura, Masato
Sakai, Shinsuke
Kawahara, Tatsuya
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6161 - 6165
[46] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
Tu, Tao
Chen, Yuan-Jui
Liu, Alexander H.
Lee, Hung-yi
[J]. INTERSPEECH 2020, 2020, : 3191 - 3195
[47] Cross-lingual multi-speaker speech synthesis with limited bilingual training data
Cai, Zexin
Yang, Yaogen
Li, Ming
[J]. COMPUTER SPEECH AND LANGUAGE, 2023, 77
[48] Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis
Hashimoto, Kei
Nankaku, Yoshihiko
Tokuda, Keiichi
[J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 120 - 123
[49] MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION
Chang, Xuankai
Zhang, Wangyou
Qian, Yanmin
Le Roux, Jonathan
Watanabe, Shinji
[J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 237 - 244
[50] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
Chien, Chung-Ming
Lin, Jheng-Hao
Huang, Chien-yu
Hsu, Po-chun
Lee, Hung-yi
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592

← 1 2 3 4 5 →