Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

被引：0

作者：

Fujita, Kenichi ^{[1
]}

Ando, Atsushi ^{[1
]}

Ijima, Yusuke ^{[1
]}

机构：

[1] NTT Corp, NTT Human Informat Labs, Yokosuka 2390847, Japan

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2024年 / E107D卷 / 01期

关键词：

speaker embedding; phoneme duration; speech synthesis; speech rhythm;

D O I：

10.1587/transinf.2023EDP7039

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for re-producing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker em-beddings generation, speech synthesis with generated embeddings, and em-bedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation anal-ysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.

引用

页码：93 / 104

页数：12

共 50 条

[31] GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
Yang, Jinhyeok
Bae, Jae-Sung
Bak, Taejun
Kim, Young-Ik
Cho, Hoon-Young
[J]. INTERSPEECH 2021, 2021, : 2202 - 2206
[32] J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
Takamichi, Shinnosuke
Nakata, Wataru
Tanji, Naoko
Saruwatari, Hiroshi
[J]. INTERSPEECH 2022, 2022, : 2358 - 2362
[33] J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis
Takamichi, Shinnosuke
Nakata, Wataru
Tanji, Naoko
Saruwatari, Hiroshi
[J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, 2022-September : 2358 - 2362
[34] Emotional Speech Synthesis for Multi-Speaker Emotional Dataset Using WaveNet Vocoder
Choi, Heejin
Park, Sangjun
Park, Jinuk
Hahn, Minsoo
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS (ICCE), 2019,
[35] An emotional speech synthesis markup language processor for multi-speaker and emotional text-to-speech applications
Ryu, Se-Hui
Cho, Hee
Lee, Ju-Hyun
Hong, Ki-Hyung
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 523 - 529
[36] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[37] LCMV BEAMFORMING WITH SUBSPACE PROJECTION FOR MULTI-SPEAKER SPEECH ENHANCEMENT
Hassani, Amin
Bertrand, Alexander
Moonen, Marc
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 91 - 95
[38] End-to-End Multilingual Multi-Speaker Speech Recognition
Seki, Hiroshi
Hori, Takaaki
Watanabe, Shinji
Le Roux, Jonathan
Hershey, John R.
[J]. INTERSPEECH 2019, 2019, : 3755 - 3759
[39] END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER
Chang, Xuankai
Zhang, Wangyou
Qian, Yanmin
Le Roux, Jonathan
Watanabe, Shinji
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6134 - 6138
[40] Sparse Component Analysis for Speech Recognition in Multi-Speaker Environment
Asaei, Afsaneh
Bourlard, Herve
Garner, Philip N.
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1704 - 1707

← 1 2 3 4 5 →