Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

被引:0
|
作者
Fujita, Kenichi [1 ]
Ando, Atsushi [1 ]
Ijima, Yusuke [1 ]
机构
[1] NTT Corp, NTT Human Informat Labs, Yokosuka 2390847, Japan
关键词
speaker embedding; phoneme duration; speech synthesis; speech rhythm;
D O I
10.1587/transinf.2023EDP7039
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for re-producing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker em-beddings generation, speech synthesis with generated embeddings, and em-bedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation anal-ysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.
引用
收藏
页码:93 / 104
页数:12
相关论文
共 50 条
  • [41] Lip2Speech: Lightweight Multi-Speaker Speech Reconstruction with Gabor Features
    Dong, Zhongping
    Xu, Yan
    Abel, Andrew
    Wang, Dong
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (02):
  • [42] The Effects of Phoneme Errors in Speaker Adaptation for HMM Speech Synthesis
    Toth, Balint
    Fegyo, Tibor
    Nemeth, Geza
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2816 - +
  • [43] Gender-Dependent Babble Maskers Created from Multi-Speaker Speech for Speech Privacy Protection
    Kondo, Kazuhiro
    Sakurai, Hiroki
    [J]. 2014 TENTH INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING (IIH-MSP 2014), 2014, : 251 - 254
  • [44] Silent versus modal multi-speaker speech recognition from ultrasound and video
    Ribeiro, Manuel Sam
    Eshky, Aciel
    Richmond, Korin
    Renals, Steve
    [J]. INTERSPEECH 2021, 2021, : 641 - 645
  • [45] MULTI-SPEAKER SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR DATA AUGMENTATION IN ACOUSTIC-TO-WORD SPEECH RECOGNITION
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6161 - 6165
  • [46] Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation
    Tu, Tao
    Chen, Yuan-Jui
    Liu, Alexander H.
    Lee, Hung-yi
    [J]. INTERSPEECH 2020, 2020, : 3191 - 3195
  • [47] Cross-lingual multi-speaker speech synthesis with limited bilingual training data
    Cai, Zexin
    Yang, Yaogen
    Li, Ming
    [J]. COMPUTER SPEECH AND LANGUAGE, 2023, 77
  • [48] Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis
    Hashimoto, Kei
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 120 - 123
  • [49] MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION
    Chang, Xuankai
    Zhang, Wangyou
    Qian, Yanmin
    Le Roux, Jonathan
    Watanabe, Shinji
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 237 - 244
  • [50] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
    Chien, Chung-Ming
    Lin, Jheng-Hao
    Huang, Chien-yu
    Hsu, Po-chun
    Lee, Hung-yi
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592