Unsupervised Speaker and Expression Factorization for Multi-Speaker Expressive Synthesis of Ebooks

被引:0
|
作者
Chen, Langzhou [1 ]
Braunschweiler, Norbert [1 ]
机构
[1] Toshiba Res Europe Ltd, Cambridge, England
关键词
expressive speech synthesis; hidden Markov model; cluster adaptive training; factorization; audiobook;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This work aims to improve expressive speech synthesis of ebooks for multiple speakers by using training data from many audiobooks. Audiobooks contain a wide variety of expressive speaking styles which are often impractical to annotate. However, the speaker -expression factorization (SEF) framework, which has been proven to be a powerful tool in speaker and expression modelling usually requires the (supervised) information about expressions in the training data. This work presents an unsupervised SEF method which implements the SEF on unlabelled training data in the framework of cluster adaptive training (CAT). The proposed method integrates the expression clustering and parameter estimation in a single process to maximize the likelihood of the training data. Experimental results indicate that it outperforms the cascade system of expression clustering and supervised SEF, and significantly improves the expressiveness of the synthetic speech of different speakers.
引用
收藏
页码:1041 / 1045
页数:5
相关论文
共 50 条
  • [41] ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis
    Xue, Jinlong
    Deng, Yayue
    Han, Yichen
    Li, Ya
    Sun, Jianqing
    Liang, Jiaen
    [J]. 2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 230 - 234
  • [42] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    [J]. INTERSPEECH 2021, 2021, : 3141 - 3145
  • [43] ForumSum: A Multi-Speaker Conversation Summarization Dataset
    Khalman, Misha
    Zhao, Yao
    Saleh, Mohammad
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 4592 - 4599
  • [44] SPEAKER CONDITIONING OF ACOUSTIC MODELS USING AFFINE TRANSFORMATION FOR MULTI-SPEAKER SPEECH RECOGNITION
    Yousefi, Midia
    Hansen, John H. L.
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 283 - 288
  • [45] Speaker Diarization in a Multi-Speaker Environment Using Particle Swarm Optimization and Mutual Information
    Mirrezaie, S. M.
    Ahadi, S. M.
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, 2008, : 1533 - 1536
  • [46] Multi-speaker articulatory trajectory formation based on speaker-independent articulatory HMMs
    Hiroya, Sadao
    Mochida, Takemi
    [J]. SPEECH COMMUNICATION, 2006, 48 (12) : 1677 - 1690
  • [47] Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS
    Udagawa, Kenta
    Saito, Yuki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2022, 2022, : 2968 - 2972
  • [48] Synthesis of expressive speaking styles with limited training data in a multi-speaker, prosody-controllable sequence-to-sequence architecture
    Shechtman, Slava
    Fernandez, Raul
    Sorin, Alexander
    Haws, David
    [J]. INTERSPEECH 2021, 2021, : 4693 - 4697
  • [49] Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis
    Fujita, Kenichi
    Ando, Atsushi
    Ijima, Yusuke
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (01) : 93 - 104
  • [50] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
    Mitsui, Kentaro
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2020, 2020, : 2032 - 2036