IMPROVING NATURALNESS AND CONTROLLABILITY OF SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS BY LEARNING LOCAL PROSODY REPRESENTATIONS

被引:7
|
作者
Gong, Cheng [1 ]
Wang, Longbiao [1 ]
Ling, Zhenhua [2 ]
Guo, Shaotong [1 ]
Zhang, Ju [3 ]
Dang, Jianwu [1 ,4 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China
[2] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Peoples R China
[3] Huiyan Technol Tianjin Co Ltd, Tianjin, Peoples R China
[4] Japan Adv Inst Sci & Technol, Nomi, Ishikawa, Japan
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
speech synthesis; pitch prediction; naturalness; pitch control;
D O I
10.1109/ICASSP39728.2021.9414720
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
State-of-the-art neural text-to-speech (TTS) networks are trained with a large amount of speech data, which significantly improves the quality of synthetic speech compared with traditional approaches. However, the prosody and controllability of the generated speech is still insufficient, especially in tonal languages. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence or words. In this study, we extended Tacotron2 with a pitch prediction task to capture discrete pitch-related representations. Specifically, the learned pitch-related suprasegmental information is fed simultaneously with traditional character features into the decoder to generate final Mel spectrogram. Experiments show that the proposed method can improve the quality of the generated speech (mean opinion score of 4.37 vs. 4.22). Moreover, we demonstrated that we can easily achieve word-level pitch control during generation by changing local pitch-related representations before passing them to the decoder network.
引用
收藏
页码:5724 / 5728
页数:5
相关论文
共 50 条
  • [1] Improving Sequence-to-sequence Tibetan Speech Synthesis with Prosodic Information
    Zhang, Weizhao
    Yang, Hongwu
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (09)
  • [2] Learning Damage Representations with Sequence-to-Sequence Models
    Yang, Qun
    Shen, Dejian
    [J]. SENSORS, 2022, 22 (02)
  • [3] Sequence-to-Sequence Video Prediction by Learning Hierarchical Representations
    Fan, Kun
    Joung, Chungin
    Baek, Seungjun
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (22): : 1 - 14
  • [4] A Sequence-to-Sequence Pronunciation Model for Bangla Speech Synthesis
    Ahmad, Arif
    Hussain, Mohammed Raihan
    Selim, Mohammad Reza
    Iqbal, Muhammed Zafar
    Rahman, Mohammad Shahidur
    [J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [5] USING LOCAL PHRASE DEPENDENCY STRUCTURE INFORMATION IN NEURAL SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS
    Kaiki, Nobuyoshi
    Sakti, Sakriani
    Nakamura, Satoshi
    [J]. 2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 206 - 211
  • [6] INVESTIGATION OF AN INPUT SEQUENCE ON THAI NEURAL SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS
    Janyoi, Pongsathon
    Thangthai, Ausdang
    [J]. 2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 218 - 223
  • [7] Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis
    Yasuda, Yusuke
    Wang, Xin
    Yamagishi, Junichi
    [J]. COMPUTER SPEECH AND LANGUAGE, 2021, 67
  • [8] Applying Syntax-Prosody Mapping Hypothesis and Boundary-Driven Theory to Neural Sequence-To-Sequence Speech Synthesis
    Furukawa, Kei
    Kishiyama, Takeshi
    Nakamura, Satoshi
    Sakti, Sakriani
    [J]. IEEE Access, 2024, 12 : 160896 - 160917
  • [9] UnitNet: A Sequence-to-Sequence Acoustic Model for Concatenative Speech Synthesis
    Zhou, Xiao
    Ling, Zhen-Hua
    Dai, Li-Rong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2643 - 2655
  • [10] FORWARD ATTENTION IN SEQUENCE-TO-SEQUENCE ACOUSTIC MODELING FOR SPEECH SYNTHESIS
    Zhang, Jing-Xuan
    Ling, Zhen-Hua
    Dai, Li-Rong
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4789 - 4793