DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding

被引:1
|
作者
Lee, Junmo [1 ]
Song, Kwangsub [1 ]
Noh, Kyoungjin [1 ]
Park, Tae-Jun [1 ]
Chang, Joon-Hyuk [1 ]
机构
[1] Hanyang Univ, Dept Elect Engn, Seoul, South Korea
关键词
deep learning; sequence to sequence; speech synthesis; multi speaker speech synthesis;
D O I
10.23919/elinfocom.2019.8706390
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, multi speaker speech synthesis using speaker embedding is proposed. The proposed model is based on Tacotron network, but post-processing network of the model is modified with dilated convolution layers, which used in Wavenet architecture, to make it more adaptive to speech. The model can generate multi speaker voice with only one neural network model by giving auxiliary input data, speaker embedding, to the network. This model shows successful result for generating two speaker's voices without significant deterioration of speech quality.
引用
收藏
页码:61 / 64
页数:4
相关论文
共 50 条
  • [1] MULTI-SPEAKER MODELING AND SPEAKER ADAPTATION FOR DNN-BASED TTS SYNTHESIS
    Fan, Yuchen
    Qian, Yao
    Soong, Frank K.
    He, Lei
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4475 - 4479
  • [2] PHONEME DEPENDENT SPEAKER EMBEDDING AND MODEL FACTORIZATION FOR MULTI-SPEAKER SPEECH SYNTHESIS AND ADAPTATION
    Fu, Ruibo
    Tao, Jianhua
    Wen, Zhengqi
    Zheng, Yibin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6930 - 6934
  • [3] Cross-lingual, Multi-speaker Text-To-Speech Synthesis Using Neural Speaker Embedding
    Chen, Mengnan
    Chen, Minchuan
    Liang, Shuang
    Ma, Jun
    Chen, Lei
    Wang, Shaojun
    Xiao, Jing
    [J]. INTERSPEECH 2019, 2019, : 2105 - 2109
  • [4] Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation
    Mitsui, Kentaro
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. SPEECH COMMUNICATION, 2021, 132 : 132 - 145
  • [5] Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS
    Udagawa, Kenta
    Saito, Yuki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2022, 2022, : 2968 - 2972
  • [6] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
    Das, Rohan Kumar
    Yang, Jichen
    Li, Haizhou
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
  • [7] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    [J]. INTERSPEECH 2021, 2021, : 3141 - 3145
  • [8] Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries
    Stafylakis, Themos
    Mosner, Ladislav
    Plchot, Oldrich
    Rohdin, Johan
    Silnova, Anna
    Burget, Lukas
    Cernocky, Jan Honza
    [J]. INTERSPEECH 2022, 2022, : 605 - 609
  • [9] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
  • [10] Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder
    Hao, Xiaoyang
    Zhang, Pengyuan
    [J]. Shengxue Xuebao/Acta Acustica, 2022, 47 (03): : 405 - 416