SegINR: Segment-Wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

被引:0
|
作者
Kim, Minchan [1 ,2 ]
Jeong, Myeonghun [1 ,2 ]
Lee, Joun Yeop [3 ]
Kim, Nam Soo [1 ,2 ]
机构
[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul 08826, South Korea
[2] Seoul Natl Univ, Inst New Media & Commun, Seoul 08826, South Korea
[3] Samsung Res, Seoul 06765, South Korea
关键词
Semantics; Predictive models; Computational modeling; Transducers; Training; Indexes; Regulation; Linguistics; Computational efficiency; Implicit neural representation; sequence alignment; text-to-speech;
D O I
10.1109/LSP.2025.3528858
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We present SegINR, a novel approach to neural Text-to-Speech (TTS) that eliminates the need for either an auxiliary duration predictor or autoregressive (AR) sequence modeling for alignment. SegINR simplifies the TTS process by directly converting text sequences into frame-level features. Encoded text embeddings are transformed into segments of frame-level features with length regulation using a conditional implicit neural representation (INR). This method, termed Segment-wise INR (SegINR), captures temporal dynamics within each segment while autonomously defining segment boundaries, resulting in lower computational costs. Integrated into a two-stage TTS framework, SegINR is employed for semantic token prediction. Experiments in zero-shot adaptive TTS scenarios show that SegINR outperforms conventional methods in speech quality with computational efficiency.
引用
收藏
页码:646 / 650
页数:5
相关论文
共 50 条
  • [41] A STUDY ON THE EFFICACY OF MODEL PRE-TRAINING IN DEVELOPING NEURAL TEXT-TO-SPEECH SYSTEM
    Zhang, Guangyan
    Leng, Yichong
    Tan, Daxin
    Qin, Ying
    Song, Kaitao
    Tan, Xu
    Zhao, Sheng
    Lee, Tan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6087 - 6091
  • [42] Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing
    Sharma, Mayank
    Virkar, Yogesh
    Federico, Marcello
    Barra-Chicote, Roberto
    Enyedi, Robert
    INTERSPEECH 2021, 2021, : 3151 - 3155
  • [43] Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0
    Corkey, Niamh
    O'Mahony, Johannah
    King, Simon
    INTERSPEECH 2023, 2023, : 2014 - 2015
  • [44] Prosody modeling for syllable based text-to-speech synthesis using feedforward neural networks
    Reddy, V. Ramu
    Rao, K. Sreenivasa
    NEUROCOMPUTING, 2016, 171 : 1323 - 1334
  • [45] SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network
    Wang, Kexin
    Zhang, Jiahong
    Ren, Yong
    Yao, Man
    Di Shang
    Xu, Bo
    Li, Guoqi
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7927 - 7940
  • [46] Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
    Luong, Hieu-Thi
    Wang, Xin
    Yamagishi, Junichi
    Nishizawa, Nobuyuki
    INTERSPEECH 2019, 2019, : 1303 - 1307
  • [47] Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks
    Valentini-Botinhao, Cassia
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 352 - 356
  • [48] Syntactic analysis and letter-to-phoneme conversion using neural networks - an application of neural networks to an English text-to-speech system
    Yamaguchi, Yukiko
    Matsumoto, Tatsuro
    Systems and Computers in Japan, 1993, 24 (08) : 71 - 81
  • [49] A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems
    Wu, Yi-Chiao
    Tobing, Patrick Lumban
    Yasuhara, Kazuki
    Matsunaga, Noriyuki
    Ohtani, Yamato
    Toda, Tomoki
    INTERSPEECH 2020, 2020, : 3540 - 3544
  • [50] Articulatory Text-to-Speech Synthesis using the Digital Waveguide Mesh driven by a Deep Neural Network
    Gully, Amelia J.
    Yoshimura, Takenori
    Murphy, Damian T.
    Hashimoto, Kei
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 234 - 238