Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

被引:2
|
作者
Zhou, Yixuan [1 ,4 ]
Song, Changhe [1 ]
Li, Jingbei [1 ]
Wu, Zhiyong [1 ,2 ]
Bian, Yanyao [3 ]
Su, Dan [3 ]
Meng, Helen [2 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[3] Tencent, Tencent AI Lab, Shenzhen, Peoples R China
[4] Tencent, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
expressive speech synthesis; semantic representation enhancing; dependency parsing; graph neural network;
D O I
10.21437/Interspeech.2022-10061
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Exploiting rich linguistic information in raw text is crucial for expressive text-to-speech (TTS). As large scale pre-trained text representation develops, bidirectional encoder representations from Transformers (BERT) has been proven to embody semantic information and employed to TTS recently. However, original or simply fine-tuned BERT embeddings still cannot provide sufficient semantic knowledge that expressive TTS models should take into account. In this paper, we propose a word-level semantic representation enhancing method based on dependency structure and pre-trained BERT embedding. The BERT embedding of each word is reprocessed considering its specific dependencies and related words in the sentence, to generate more effective semantic representation for TTS. To better utilize the dependency structure, relational gated graph network (RGGN) is introduced to make semantic information flow and aggregate through the dependency structure. The experimental results show that the proposed method can further improve the naturalness and expressiveness of synthesized speeches on both Mandarin and English datasets(1).
引用
收藏
页码:5518 / 5522
页数:5
相关论文
共 47 条
  • [11] USING VAES AND NORMALIZING FLOWS FOR ONE-SHOT TEXT-TO-SPEECH SYNTHESIS OF EXPRESSIVE SPEECH
    Aggarwal, Vatsal
    Cotescu, Marius
    Prateek, Nishant
    Lorenzo-Trueba, Jaime
    Barra-Chicote, Roberto
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6179 - 6183
  • [12] Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
    Paul, Dipjyoti
    Shifas, Muhammed P., V
    Pantazis, Yannis
    Stylianou, Yannis
    INTERSPEECH 2020, 2020, : 1361 - 1365
  • [13] Fluent Personalized Speech Synthesis with Prosodic Word-Level Spontaneous Speech generation
    Huang, Yi-Chin
    Wu, Chung-Hsien
    Shie, Ming-Ge
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 294 - 298
  • [14] WORD-LEVEL EMPHASIS MODELLING IN HMM-BASED SPEECH SYNTHESIS
    Yu, K.
    Mairesse, F.
    Young, S.
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4238 - 4241
  • [15] INTONATIONAL PHRASE BREAK PREDICTION FOR TEXT-TO-SPEECH SYNTHESIS USING DEPENDENCY RELATIONS
    Mishra, Taniya
    Kim, Yeon-jun
    Bangalore, Srinivas
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 4919 - 4923
  • [16] Better Human Computer Interaction by Enhancing the Quality of Text-to-Speech Synthesis
    Reddy, V. Ramu
    Rao, K. Sreenivasa
    4TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN COMPUTER INTERACTION (IHCI 2012), 2012,
  • [17] E-TTS: Expressive Text-to-Speech Synthesis for Hindi Using Data Augmentation
    Gupta, Ishika
    Murthy, Hema A.
    SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 243 - 257
  • [18] Detecting Word-Level Adversarial Text Attacks via SHapley Additive exPlanations
    Huber, Lukas
    Kuehn, Marc Alexander
    Mosca, Edoardo
    Groh, Georg
    PROCEEDINGS OF THE 7TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP, 2022, : 156 - 166
  • [19] Modelling speech temporal structure for Estonian text-to-speech synthesis: Feature selection
    Mihkla, Meelis
    TRAMES-JOURNAL OF THE HUMANITIES AND SOCIAL SCIENCES, 2007, 11 (03): : 284 - 298
  • [20] Algorithms for Speech Segmentation at Syllable-Level for Text-to-Speech Synthesis System in Gujarati
    Patil, Hemant A.
    Patel, Tanvina
    Talesara, Swati
    Shah, Nirmesh
    Sailor, Hardik
    Vachhani, Bhavik
    Akhani, Janki
    Kanakiya, Bhargav
    Gaur, Yashesh
    Prajapati, Vibha
    2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,