Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS

被引:2
|
作者
An, Xiaochun [1 ]
Soong, Frank K. [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
来源
关键词
neural TTS; style transfer; style distortion; cycle consistency; disjoint datasets;
D O I
10.21437/Interspeech.2021-1407
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to "fool" a well-trained discriminator; 3) a style distortion loss to measure the expected style loss after the transfer; 4) a cycle consistency loss to preserve the speaker identity of the source after the transfer. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of the new approach is better and more robust than those of four baseline systems of the prior art.
引用
收藏
页码:4688 / 4692
页数:5
相关论文
共 50 条
  • [31] Improving End-to-End Speech-to-Intent Classification with Reptile
    Tian, Yusheng
    Gorinski, Philip John
    INTERSPEECH 2020, 2020, : 891 - 895
  • [32] IMPROVING RNN TRANSDUCER MODELING FOR END-TO-END SPEECH RECOGNITION
    Li, Jinyu
    Zhao, Rui
    Hu, Hu
    Gong, Yifan
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 114 - 121
  • [33] END-TO-END LYRICS RECOGNITION WITH VOICE TO SINGING STYLE TRANSFER
    Basak, Sakya
    Agarwal, Shrutina
    Ganapathy, Sriram
    Takahashi, Naoya
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 266 - 270
  • [34] Improving End-to-End Speech Translation by Leveraging Auxiliary Speech and Text Data
    Zhang, Yuhao
    Xu, Chen
    Hu, Bojie
    Zhang, Chunliang
    Xiao, Tong
    Zhu, Jingbo
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13984 - 13992
  • [35] Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    INTERSPEECH 2022, 2022, : 4581 - 4585
  • [36] Segmental Recurrent Neural Networks for End-to-end Speech Recognition
    Lu, Liang
    Kong, Lingpeng
    Dyer, Chris
    Smith, Noah A.
    Renals, Steve
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 385 - 389
  • [37] A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning
    Oyucu, Saadin
    ELECTRONICS, 2023, 12 (08)
  • [38] End-to-End Speech Emotion Recognition Based on Neural Network
    Zhu, Bing
    Zhou, Wenkai
    Wang, Yutian
    Wang, Hui
    Cai, Juan Juan
    2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1634 - 1638
  • [39] END-TO-END OPTIMIZED SPEECH CODING WITH DEEP NEURAL NETWORKS
    Kankanahalli, Srihari
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2521 - 2525
  • [40] ESPRESSO: A FAST END-TO-END NEURAL SPEECH RECOGNITION TOOLKIT
    Wang, Yiming
    Chen, Tongfei
    Xu, Hainan
    Ding, Shuoyang
    Lv, Hang
    Shao, Yiwen
    Peng, Nanyun
    Xie, Lei
    Watanabe, Shinji
    Khudanpur, Sanjeev
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 136 - 143