PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION

被引:0
|
作者
Fu, Ruibo [1 ,2 ]
Tao, Jianhua [1 ,2 ,3 ]
Wen, Zhengqi [1 ]
Zheng, Yibin [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
[3] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
speech synthesis; speaker adaptation; speaker embedding; phoneme representation; SYNTHESIS SYSTEM; NEURAL-NETWORKS;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents an architecture to perform speaker adaption in long short-term memory (LSTM) based Mandarin statistical parametric speech synthesis system. Compared with the conventional methods that focused on using fixed global speaker representations in utterance level for speaker recognition task, the proposed method extracts speaker representations in utterance and phoneme level, which can describe more pronunciation characteristics in phoneme level. And an attention mechanism is deployed to combine each level representations dynamically to train a task-specific phoneme dependent speaker embedding. To handle the unbalanced database and avoid over-fitting, the model is factored into an average model and an adaptation model and combined by an attention mechanism. We investigate the performance of speaker representations extracted by different methods. Experimental results confirm the adaptability of our proposed speaker embedding and model factorization structure. And listening tests demonstrate that our proposed method can achieve better adaptation performance than baselines in terms of naturalness and speaker similarity.
引用
收藏
页码:6930 / 6934
页数:5
相关论文
共 50 条
  • [1] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    [J]. INTERSPEECH 2021, 2021, : 3141 - 3145
  • [2] DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding
    Lee, Junmo
    Song, Kwangsub
    Noh, Kyoungjin
    Park, Tae-Jun
    Chang, Joon-Hyuk
    [J]. 2019 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2019, : 61 - 64
  • [3] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
  • [4] Unsupervised Discovery of Phoneme Boundaries in Multi-Speaker Continuous Speech
    Armstrong, Tom
    Antetomaso, Stephanie
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING (ICDL), 2011,
  • [5] Unsupervised Speaker and Expression Factorization for Multi-Speaker Expressive Synthesis of Ebooks
    Chen, Langzhou
    Braunschweiler, Norbert
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1041 - 1045
  • [6] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
    Chen, Mengnan
    Chen, Minchuan
    Liang, Shuang
    Ma, Jun
    Chen, Lei
    Wang, Shaojun
    Xiao, Jing
    [J]. INTERSPEECH 2019, 2019, : 2105 - 2109
  • [7] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
    Das, Rohan Kumar
    Yang, Jichen
    Li, Haizhou
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
  • [8] MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
    Fan, Yuchen
    Qian, Yao
    Soong, Frank K.
    He, Lei
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4475 - 4479
  • [9] The Effects of Phoneme Errors in Speaker Adaptation for HMM Speech Synthesis
    Toth, Balint
    Fegyo, Tibor
    Nemeth, Geza
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2816 - +
  • [10] Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries
    Stafylakis, Themos
    Mosner, Ladislav
    Plchot, Oldrich
    Rohdin, Johan
    Silnova, Anna
    Burget, Lukas
    Cernocky, Jan Honza
    [J]. INTERSPEECH 2022, 2022, : 605 - 609