SegINR: Segment-Wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

被引:0
|
作者
Kim, Minchan [1 ,2 ]
Jeong, Myeonghun [1 ,2 ]
Lee, Joun Yeop [3 ]
Kim, Nam Soo [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul 08826, South Korea
[2] Seoul Natl Univ, Inst New Media & Commun, Seoul 08826, South Korea
[3] Samsung Res, Seoul 06765, South Korea
关键词
Semantics; Predictive models; Computational modeling; Transducers; Training; Indexes; Regulation; Linguistics; Computational efficiency; Implicit neural representation; sequence alignment; text-to-speech;
D O I
10.1109/LSP.2025.3528858
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We present SegINR, a novel approach to neural Text-to-Speech (TTS) that eliminates the need for either an auxiliary duration predictor or autoregressive (AR) sequence modeling for alignment. SegINR simplifies the TTS process by directly converting text sequences into frame-level features. Encoded text embeddings are transformed into segments of frame-level features with length regulation using a conditional implicit neural representation (INR). This method, termed Segment-wise INR (SegINR), captures temporal dynamics within each segment while autonomously defining segment boundaries, resulting in lower computational costs. Integrated into a two-stage TTS framework, SegINR is employed for semantic token prediction. Experiments in zero-shot adaptive TTS scenarios show that SegINR outperforms conventional methods in speech quality with computational efficiency.
引用
收藏
页码:646 / 650
页数:5
相关论文
共 50 条
  • [31] Controlling formant frequencies with neural text-to-speech for the manipulation of perceived speaker age
    Khan, Ziya
    Wihlborg, Lovisa
    Valentini-Botinhao, Cassia
    Watts, Oliver
    INTERSPEECH 2023, 2023, : 4359 - 4363
  • [32] SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis
    Maniati, Georgia
    Vioni, Alexandra
    Ellinas, Nikolaos
    Nikitaras, Karolos
    Klapsas, Konstantinos
    Sung, June Sig
    Jho, Gunu
    Chalamandaris, Aimilios
    Tsiakoulis, Pirros
    INTERSPEECH 2022, 2022, : 2388 - 2392
  • [33] Japanese Neural Incremental Text-to-Speech Synthesis Framework With an Accent Phrase Input
    Yanagita, Tomoya
    Sakti, Sakriani
    Nakamura, Satoshi
    IEEE ACCESS, 2023, 11 : 22355 - 22363
  • [34] A NEURAL TEXT-TO-SPEECH MODEL UTILIZING BROADCAST DATA MIXED WITH BACKGROUND MUSIC
    Bae, Hanbin
    Bae, Jae-Sung
    Joo, Young-Sun
    Kim, Young-Ik
    Cho, Hoon-Young
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6603 - 6607
  • [35] A STUDY ON NEURAL-NETWORK-BASED TEXT-TO-SPEECH ADAPTATION TECHNIQUES FOR VIETNAMESE
    Pham Ngoc Phuong
    Chung Tran Quang
    Quoc Truong Do
    Mai Chi Luong
    2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2021, : 199 - 205
  • [36] Text to Phoneme Alignment and Mapping for Speech Technology: A Neural Networks Approach
    Bullinaria, John A.
    2011 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2011, : 625 - 632
  • [37] Predication of prosodic data in Persian text-to-speech systems using recurrent neural network
    Farrokhi, A
    Ghaemmaghami, S
    ELECTRONICS LETTERS, 2003, 39 (25) : 1868 - 1869
  • [38] Optimisation of artificial neural network topology applied in the prosody control in text-to-speech synthesis
    Sebesta, V
    Tucková, J
    SOFSEM 2000: THEORY AND PRACTICE OF INFORMATICS, 2000, 1963 : 420 - 430
  • [39] EFFECT OF CHOICE OF PROBABILITY DISTRIBUTION, RANDOMNESS, AND SEARCH METHODS FOR ALIGNMENT MODELING IN SEQUENCE-TO-SEQUENCE TEXT-TO-SPEECH SYNTHESIS USING HARD ALIGNMENT
    Yasuda, Yusuke
    Wang, Xin
    Yamagishi, Junichi
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6724 - 6728
  • [40] NEURAL-NETWORK-BASED F0 TEXT-TO-SPEECH SYNTHESIZER FOR MANDARINE
    HWANG, SH
    CHEN, SH
    IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 1994, 141 (06): : 384 - 390