Low-Resource Emotional Speech Synthesis: Transfer Learning and Data Requirements

被引:0
|
作者
Nesterenko, Anton [1 ,2 ]
Akhmerov, Ruslan [2 ]
Matveeva, Yulia [3 ]
Goremykina, Anna [2 ]
Astankov, Dmitry [2 ]
Shuranov, Evgeniy [4 ]
Shirshova, Alexandra [3 ]
机构
[1] Ivanovo State Univ Chem & Technol, Ivanovo, Russia
[2] Big Data Acad MADE VK, St Petersburg, Russia
[3] Huawei St Petersburg Res Ctr, St Petersburg, Russia
[4] ITMO Univ, St Petersburg, Russia
来源
关键词
Emotional speech synthesis; Expressive speech synthesis; Data requirements; Low-resource text-to-speech; Adversarial training; Transfer learning from speaker verification;
D O I
10.1007/978-3-031-20980-2_43
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, a number of solutions were proposed that improved on ways of adding an emotional aspect to speech synthesis. Combined with core neural text-to-speech architectures that reach high naturalness scores, these models are capable of producing natural human-like speech with well discernible emotions and even model their intensities. To successfully synthesize emotions the models are trained on hours of emotional data. In practice however, it is often difficult and rather expensive to collect a lot of emotional speech data per speaker. In this article, we inquire upon the minimal data requirements of expressive text-to-speech solutions to be applied in practical scenarios and also find an optimal architecture for low-resource training. In particular, a different number of training speakers and a different amount of data per emotion are considered. Frequently occurring situations are considered when there is a large multi-speaker dataset with neutral records and a large single-speaker emotional dataset, but there is little emotional data for the remaining speakers. On top of that we study the effect of several architecture modifications and training procedures (namely adversarial training and transfer learning from speaker verification) on the quality of the models as well as their data avidity. Our results show that transfer learning may lower data requirements from 15 min per speaker per emotion to just 2.5-7 min maintaining non-significant changes in voice naturalness and giving high emotion recognition rates. We also show how the data requirements change from one emotion to another. A demo page illustrating the main findings of this work is available at: https://diparty.github.io/projects/tts/emo/nat.
引用
收藏
页码:508 / 521
页数:14
相关论文
共 50 条
  • [1] Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization
    Pine, Aidan
    Wells, Dan
    Brinklow, Nathan Thanyehtenhas
    Littell, Patrick
    Richmond, Korin
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7346 - 7359
  • [2] LOW-RESOURCE LANGUAGE IDENTIFICATION FROM SPEECH USING TRANSFER LEARNING
    Feng, Kexin
    Chaspari, Theodora
    [J]. 2019 IEEE 29TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2019,
  • [3] Language-Adversarial Transfer Learning for Low-Resource Speech Recognition
    Yi, Jiangyan
    Tao, Jianhua
    Wen, Zhengqi
    Bai, Ye
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (03) : 621 - 630
  • [4] DATA-DRIVEN PHRASING FOR SPEECH SYNTHESIS IN LOW-RESOURCE LANGUAGES
    Parlikar, Alok
    Black, Alan W.
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4013 - 4016
  • [5] Transfer Learning from Multi-Lingual Speech Translation Benefits Low-Resource Speech Recognition
    Vanderreydt, Geoffroy
    Remy, Francois
    Demuynck, Kris
    [J]. INTERSPEECH 2022, 2022, : 3053 - 3057
  • [6] Cross-lingual offensive speech identification with transfer learning for low-resource languages
    Shi, Xiayang
    Liu, Xinyi
    Xu, Chun
    Huang, Yuanyuan
    Chen, Fang
    Zhu, Shaolin
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2022, 101
  • [7] Optimizing Data Usage for Low-Resource Speech Recognition
    Qian, Yanmin
    Zhou, Zhikai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 394 - 403
  • [8] LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition
    Xu, Jin
    Tan, Xu
    Ren, Yi
    Qin, Tao
    Li, Jian
    Zhao, Sheng
    Liu, Tie-Yan
    [J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 2802 - 2812
  • [9] Optimizing HMM Speech Synthesis for Low-Resource Devices
    Toth, Balint
    Nemeth, Geza
    [J]. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2012, 16 (02) : 327 - 334
  • [10] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Zolzaya Byambadorj
    Ryota Nishimura
    Altangerel Ayush
    Kengo Ohta
    Norihide Kitaoka
    [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2021