Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS

被引:2
|
作者
An, Xiaochun [1 ]
Soong, Frank K. [2 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Microsoft China, Beijing, Peoples R China
来源
关键词
neural TTS; style transfer; style distortion; cycle consistency; disjoint datasets;
D O I
10.21437/Interspeech.2021-1407
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
End-to-end neural TTS training has shown improved performance in speech style transfer. However, the improvement is still limited by the training data in both target styles and speakers. Inadequate style transfer performance occurs when the trained TTS tries to transfer the speech to a target style from a new speaker with an unknown, arbitrary style. In this paper, we propose a new approach to style transfer for both seen and unseen styles, with disjoint, multi-style datasets, i.e., datasets of different styles are recorded, each individual style is by one speaker with multiple utterances. To encode the style information, we adopt an inverse autoregressive flow (IAF) structure to improve the variational inference. The whole system is optimized to minimize a weighed sum of four different loss functions: 1) a reconstruction loss to measure the distortions in both source and target reconstructions; 2) an adversarial loss to "fool" a well-trained discriminator; 3) a style distortion loss to measure the expected style loss after the transfer; 4) a cycle consistency loss to preserve the speaker identity of the source after the transfer. Experiments demonstrate, both objectively and subjectively, the effectiveness of the proposed approach for seen and unseen style transfer tasks. The performance of the new approach is better and more robust than those of four baseline systems of the prior art.
引用
收藏
页码:4688 / 4692
页数:5
相关论文
共 50 条
  • [41] Towards End-to-End Speech Recognition with Recurrent Neural Networks
    Graves, Alex
    Jaitly, Navdeep
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1764 - 1772
  • [42] END-TO-END NEURAL NETWORK BASED AUTOMATED SPEECH SCORING
    Chen, Lei
    Tao, Jidong
    Ghaffarzadegan, Shabnam
    Qian, Yao
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6234 - 6238
  • [43] Enhancing Performance on Seen and Unseen Dialogue Scenarios using Retrieval-Augmented End-to-End Task-Oriented System
    Zhang, Jianguo
    Roller, Stephen
    Qian, Kun
    Liu, Zhiwei
    Meng, Rui
    Heinecke, Shelby
    Wang, Huan
    Savarese, Silvio
    Xiong, Caiming
    24TH MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE, SIGDIAL 2023, 2023, : 509 - 518
  • [44] Investigation of Transfer Learning for End-to-End Russian Speech Recognition
    Kipyatkova, Irina
    SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 349 - 357
  • [45] INVESTIGATING CONTEXT FEATURES HIDDEN IN END-TO-END TTS
    Mametani, Kohki
    Kato, Tsuneo
    Yamamoto, Seiichi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6920 - 6924
  • [46] Improving end-to-end performance by active queue management
    Ku, CF
    Chen, SJ
    Ho, JM
    Chang, RI
    AINA 2005: 19th International Conference on Advanced Information Networking and Applications, Vol 2, 2005, : 337 - 340
  • [47] Improving Performance of End-to-End ASR on Numeric Sequences
    Peyser, Cal
    Zhang, Hao
    Sainath, Tara N.
    Wu, Zelin
    INTERSPEECH 2019, 2019, : 2185 - 2189
  • [48] LEARNING HIERARCHICAL REPRESENTATIONS FOR EXPRESSIVE SPEAKING STYLE IN END-TO-END SPEECH SYNTHESIS
    An, Xiaochun
    Wang, Yuxuan
    Yang, Shan
    Ma, Zejun
    Xie, Lei
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 184 - 191
  • [49] PREDICTING EXPRESSIVE SPEAKING STYLE FROM TEXT IN END-TO-END SPEECH SYNTHESIS
    Stanton, Daisy
    Wang, Yuxuan
    Skerry-Ryan, R. J.
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 595 - 602
  • [50] A FAST END-TO-END METHOD WITH STYLE TRANSFER FOR ROOM LAYOUT ESTIMATION
    Chen, Junming
    Shao, Tie
    Zhang, Dongyang
    Wu, Xuehui
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 964 - 969