SegINR: Segment-Wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

被引:0
|
作者
Kim, Minchan [1 ,2 ]
Jeong, Myeonghun [1 ,2 ]
Lee, Joun Yeop [3 ]
Kim, Nam Soo [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul 08826, South Korea
[2] Seoul Natl Univ, Inst New Media & Commun, Seoul 08826, South Korea
[3] Samsung Res, Seoul 06765, South Korea
关键词
Semantics; Predictive models; Computational modeling; Transducers; Training; Indexes; Regulation; Linguistics; Computational efficiency; Implicit neural representation; sequence alignment; text-to-speech;
D O I
10.1109/LSP.2025.3528858
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We present SegINR, a novel approach to neural Text-to-Speech (TTS) that eliminates the need for either an auxiliary duration predictor or autoregressive (AR) sequence modeling for alignment. SegINR simplifies the TTS process by directly converting text sequences into frame-level features. Encoded text embeddings are transformed into segments of frame-level features with length regulation using a conditional implicit neural representation (INR). This method, termed Segment-wise INR (SegINR), captures temporal dynamics within each segment while autonomously defining segment boundaries, resulting in lower computational costs. Integrated into a two-stage TTS framework, SegINR is employed for semantic token prediction. Experiments in zero-shot adaptive TTS scenarios show that SegINR outperforms conventional methods in speech quality with computational efficiency.
引用
收藏
页码:646 / 650
页数:5
相关论文
共 50 条
  • [1] Statistical Text-to-Speech Synthesis Based on Segment-Wise Representation With a Norm Constraint
    Tiomkin, Stas
    Malah, David
    Shechtman, Slava
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05): : 1077 - 1082
  • [2] PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH
    Karlapati, Sri
    Abbas, Ammar
    Hodari, Zack
    Moinet, Alexis
    Joly, Arnaud
    Karanasou, Penny
    Drugman, Thomas
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6573 - 6577
  • [3] GRAPHTTS: GRAPH-TO-SEQUENCE MODELLING IN NEURAL TEXT-TO-SPEECH
    Sun, Aolan
    Wang, Jianzong
    Cheng, Ning
    Peng, Huayi
    Zeng, Zhen
    Xiao, Jing
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6719 - 6723
  • [4] Neural networks for text-to-speech phoneme recognition
    Embrechts, MJ
    Arciniegas, F
    SMC 2000 CONFERENCE PROCEEDINGS: 2000 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN & CYBERNETICS, VOL 1-5, 2000, : 3582 - 3587
  • [5] CATOTRON - A Neural Text-to-Speech System in Catalan
    Kulebi, Baybars
    Oktem, Alp
    Peiro-Lilja, Alex
    Pascual, Santiago
    Farrus, Mireia
    INTERSPEECH 2020, 2020, : 490 - 491
  • [6] Decoding Knowledge Transfer for Neural Text-to-Speech Training
    Liu, Rui
    Sisman, Berrak
    Gao, Guanglai
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1789 - 1802
  • [7] Neural networks in text-to-speech systems for the Greek language
    Falas, T
    Stafylopatis, AG
    MELECON 2000: INFORMATION TECHNOLOGY AND ELECTROTECHNOLOGY FOR THE MEDITERRANEAN COUNTRIES, VOLS 1-3, PROCEEDINGS, 2000, : 574 - 577
  • [8] TACOTRON-BASED ACOUSTIC MODEL USING PHONEME ALIGNMENT FOR PRACTICAL NEURAL TEXT-TO-SPEECH SYSTEMS
    Okamoto, Takuma
    Toda, Tomoki
    Shiga, Yoshinori
    Kawai, Hisashi
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 214 - 221
  • [9] Deep Voice: Real-time Neural Text-to-Speech
    Arik, Sercan O.
    Chrzanowski, Mike
    Coates, Adam
    Diamos, Gregory
    Gibiansky, Andrew
    Kang, Yongguo
    Li, Xian
    Miller, John
    Ng, Andrew
    Raiman, Jonathan
    Sengupta, Shubho
    Shoeybi, Mohammad
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [10] FastTalker: A neural text-to-speech architecture with shallow and group autoregression
    Liu, Rui
    Sisman, Berrak
    Lin, Yixing
    Li, Haizhou
    NEURAL NETWORKS, 2021, 141 : 306 - 314