Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis

被引:2
|
作者
Zhou, Yixuan [1 ,4 ]
Song, Changhe [1 ]
Li, Jingbei [1 ]
Wu, Zhiyong [1 ,2 ]
Bian, Yanyao [3 ]
Su, Dan [3 ]
Meng, Helen [2 ]
机构
[1] Tsinghua Univ, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Peoples R China
[3] Tencent, Tencent AI Lab, Shenzhen, Peoples R China
[4] Tencent, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
expressive speech synthesis; semantic representation enhancing; dependency parsing; graph neural network;
D O I
10.21437/Interspeech.2022-10061
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Exploiting rich linguistic information in raw text is crucial for expressive text-to-speech (TTS). As large scale pre-trained text representation develops, bidirectional encoder representations from Transformers (BERT) has been proven to embody semantic information and employed to TTS recently. However, original or simply fine-tuned BERT embeddings still cannot provide sufficient semantic knowledge that expressive TTS models should take into account. In this paper, we propose a word-level semantic representation enhancing method based on dependency structure and pre-trained BERT embedding. The BERT embedding of each word is reprocessed considering its specific dependencies and related words in the sentence, to generate more effective semantic representation for TTS. To better utilize the dependency structure, relational gated graph network (RGGN) is introduced to make semantic information flow and aggregate through the dependency structure. The experimental results show that the proposed method can further improve the naturalness and expressiveness of synthesized speeches on both Mandarin and English datasets(1).
引用
收藏
页码:5518 / 5522
页数:5
相关论文
共 47 条
  • [21] Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application
    Uchimoto, Kiyotaka
    Den, Yasuharu
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 3118 - 3122
  • [22] CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
    Meng, Yi
    Li, Xiang
    Wu, Zhiyong
    Li, Tingtian
    Sun, Zixun
    Xiao, Xinyu
    Sun, Chi
    Zhan, Hui
    Meng, Helen
    INTERSPEECH 2022, 2022, : 5533 - 5537
  • [23] RBCA-ETS: enhancing extractive text summarization with contextual embedding and word-level attention
    Ravindra Gangundi
    Rajeswari Sridhar
    International Journal of Information Technology, 2025, 17 (2) : 1127 - 1135
  • [24] DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning
    Saeki, Takaaki
    Tachibana, Kentaro
    Yamamoto, Ryuichi
    INTERSPEECH 2022, 2022, : 793 - 797
  • [25] Measuring Semantic Similarity of Bengali Texts with Parts-of-Speech Tags and Word-Level Semantics
    Atabuzzaman, Md
    Shajalal, Md
    2020 23RD INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT 2020), 2020,
  • [26] Enhancing Privacy Preservation with Quantum Computing for Word-Level Audio-Visual Speech Recognition
    Wang, Chang
    Du, Jun
    Chen, Hang
    Wang, Ruoyu
    Yang, Chao-Han Huck
    Zhao, Jiangjiang
    Ren, Yuling
    Li, Qinglong
    Lee, Chin-Hui
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 635 - 642
  • [27] Document Structure Analysis and Text Normalization for Chinese Putonghua and Cantonese Text-to-Speech Synthesis
    Zhou, Xinxin
    Wu, Zhiyong
    Yuan, Chun
    Zhong, Yuzhuo
    2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL I, PROCEEDINGS, 2008, : 477 - 481
  • [28] Phonemic transcription by analogy in text-to-speech synthesis: Novel word pronunciation and lexicon compression
    Bagshaw, PC
    COMPUTER SPEECH AND LANGUAGE, 1998, 12 (02): : 119 - 142
  • [29] Statistical Text-to-Speech Synthesis Based on Segment-Wise Representation With a Norm Constraint
    Tiomkin, Stas
    Malah, David
    Shechtman, Slava
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05): : 1077 - 1082
  • [30] Relation Extraction in Vietnamese Text via Piecewise Convolution Neural Network with Word-Level Attention
    Van-Nhat Nguyen
    Ha-Thanh Nguyen
    Dinh-Hieu Vo
    Le-Minh Nguyen
    PROCEEDINGS OF 2018 5TH NAFOSTED CONFERENCE ON INFORMATION AND COMPUTER SCIENCE (NICS 2018), 2018, : 99 - 103