Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis

被引:3
|
作者
Fujita, Kenichi [1 ]
Ando, Atsushi [1 ]
Ijima, Yusuke [1 ]
机构
[1] NTT Corp, Tokyo, Japan
来源
关键词
speaker embedding; phoneme duration; speech synthesis; speaking rhythm;
D O I
10.21437/Interspeech.2021-826
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper proposes a novel speech-rhythm-based method for speaker embeddings. Conventionally spectral feature-based speaker embedding vectors such as the x-vector are used as auxiliary information for multi-speaker speech synthesis. However, speech synthesis with conventional embeddings has difficulty reproducing the target speaker's speech rhythm, one of the important factors among speaker characteristics, because spectral features do not explicitly include speech rhythm. In this paper, speaker embeddings that take speech rhythm information into account are introduced to achieve phoneme duration modeling using a few utterances by the target speaker. A novel point of the proposed method is that rhythm-based embeddings are extracted with phonemes and their durations. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted two experiments: speaker embeddings generation and speech synthesis with generated embeddings. We show that the proposed model has an EER of 10.3% in speaker identification even with only speech rhythm. Visualizing the embeddings shows that utterances with similar rhythms are also similar in their speaker embeddings. The results of an objective and subjective evaluation on speech synthesis demonstrate that the proposed method can synthesize speech with speech rhythm closer to the target speaker.
引用
收藏
页码:3141 / 3145
页数:5
相关论文
共 50 条
  • [1] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
  • [2] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    Zheng, Yibin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
  • [3] Unsupervised Discovery of Phoneme Boundaries in Multi-Speaker Continuous Speech
    Armstrong, Tom
    Antetomaso, Stephanie
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING (ICDL), 2011,
  • [4] End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning
    Denisov, Pavel
    Ngoc Thang Vu
    [J]. INTERSPEECH 2019, 2019, : 4425 - 4429
  • [5] MULTI-SPEAKER EMOTIONAL ACOUSTIC MODELING FOR CNN-BASED SPEECH SYNTHESIS
    Choi, Heejin
    Park, Sangjun
    Park, Jinuk
    Hahn, Minsoo
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6950 - 6954
  • [6] Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation
    Mitsui, Kentaro
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. SPEECH COMMUNICATION, 2021, 132 : 132 - 145
  • [7] DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding
    Lee, Junmo
    Song, Kwangsub
    Noh, Kyoungjin
    Park, Tae-Jun
    Chang, Joon-Hyuk
    [J]. 2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 61 - 64
  • [8] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
    Das, Rohan Kumar
    Yang, Jichen
    Li, Haizhou
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
  • [9] MULTI-SPEAKER EMOTIONAL SPEECH SYNTHESIS WITH FINE-GRAINED PROSODY MODELING
    Lu, Chunhui
    Wen, Xue
    Liu, Ruolan
    Chen, Xiao
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5729 - 5733
  • [10] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
    Chen, Mengnan
    Chen, Minchuan
    Liang, Shuang
    Ma, Jun
    Chen, Lei
    Wang, Shaojun
    Xiao, Jing
    [J]. INTERSPEECH 2019, 2019, : 2105 - 2109