Real-Time Full-Band Voice Conversion with Sub-Band Modeling and Data-Driven Phase Estimation of Spectral Differentials

被引:0
|
作者
Saeki, Takaaki [1 ]
Saito, Yuki [1 ]
Takamichi, Shinnosuke [1 ]
Saruwatari, Hiroshi [1 ]
机构
[1] Univ Tokyo, Tokyo 1138656, Japan
关键词
voice conversion; spectral differentials; deep neural networks; data-driven phase; sub-band modeling; SPEECH SYNTHESIS; REPRESENTATIONS; SYSTEM;
D O I
10.1587/transinf.2020EDP7252
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes two high-fidelity and computationally efficient neural voice conversion (VC) methods based on a direct waveform modification using spectral differentials. The conventional spectral-differential VC method with a minimum-phase filter achieves high-quality conversion for narrow-band (16 kHz-sampled) VC but requires heavy computational cost in filtering. This is because the minimum phase obtained using a fixed lifter of the Hilbert transform often results in a long-tap filter. Furthermore, when we extend the method to full-band (48 kHz-sampled) VC, the computational cost is heavy due to increased sampling points, and the converted-speech quality degrades due to large fluctuations in the high-frequency band. To construct a short-tap filter, we propose a lifter-training method for data-driven phase reconstruction that trains a lifter of the Hilbert transform by taking into account filter truncation. We also propose a frequency-band-wise modeling method based on sub-band multi-rate signal processing (sub-band modeling method) for full-band VC. It enhances the computational efficiency by reducing sampling points of signals converted with filtering and improves converted-speech quality by modeling only the low-frequency band. We conducted several objective and subjective evaluations to investigate the effectiveness of the proposed methods through implementation of the real-time, online, full-band VC system we developed, which is based on the proposed methods. The results indicate that 1) the proposed lifter-training method for narrow-band VC can shorten the tap length to 1/16 without degrading the converted-speech quality, and 2) the proposed sub-band modeling method for full-band VC can improve the converted-speech quality while reducing the computational cost, and 3) our real-time, online, full-band VC system can convert 48 kHz-sampled speech in real time attaining the converted speech with a 3.6 out of 5.0 mean opinion score of naturalness.
引用
收藏
页码:1002 / 1016
页数:15
相关论文
共 43 条
  • [1] Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement
    Chen, Zhuangqi
    Zhang, Pingjian
    [J]. INTERSPEECH 2022, 2022, : 921 - 925
  • [2] FULLSUBNET: A FULL-BAND AND SUB-BAND FUSION MODEL FOR REAL-TIME SINGLE-CHANNEL SPEECH ENHANCEMENT
    Hao, Xiang
    Su, Xiangdong
    Horaud, Radu
    Li, Xiaofei
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6633 - 6637
  • [3] LIFTER TRAINING AND SUB-BAND MODELING FOR COMPUTATIONALLY EFFICIENT AND HIGH-QUALITY VOICE CONVERSION USING SPECTRAL DIFFERENTIALS
    Saeki, Takaaki
    Saito, Yuki
    Takamichi, Shinnosuke
    Saruwatari, Hiroshi
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7784 - 7788
  • [4] Real-time, full-band, online DNN-based voice conversion system using a single CPU
    Saeki, Takaaki
    Saito, Yuki
    Takamichi, Shinnosuke
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2020, 2020, : 1021 - 1022
  • [5] Full-Band LPCNet: A Real-Time Neural Vocoder for 48 kHz Audio With a CPU
    Matsubara, Keisuke
    Okamoto, Takuma
    Takashima, Ryoichi
    Takiguchi, Tetsuya
    Toda, Tomoki
    Shiga, Yoshinori
    Kawai, Hisashi
    [J]. IEEE ACCESS, 2021, 9 : 94923 - 94933
  • [6] A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement
    Valin, Jean-Marc
    [J]. 2018 IEEE 20TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2018,
  • [7] TS-CGANet: A Two-Stage Complex and Real Dual-Path Sub-Band Fusion Network for Full-Band Speech Enhancement
    Chen, Haozhe
    Zhang, Xiaojuan
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [8] DATA-DRIVEN FRAMEWORK FOR REAL-TIME THERMOSPHERIC DENSITY ESTIMATION
    Mehta, Piyush M.
    Linares, Richard
    [J]. ASTRODYNAMICS 2018, PTS I-IV, 2019, 167 : 191 - 207
  • [9] TEA-PSE 2.0: SUB-BAND NETWORK FOR REAL-TIME PERSONALIZED SPEECH ENHANCEMENT
    Ju, Yukai
    Zhang, Shimin
    Rao, Wei
    Wang, Yannan
    Yu, Tao
    Xie, Lei
    Shang, Shidong
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 472 - 479
  • [10] Sub-band Digital Predistortion for Noncontiguous Transmissions: Algorithm Development and Real-Time Prototype Implementation
    Abdelaziz, Mahmoud
    Tarver, Chance
    Li, Kaipeng
    Anttila, Lauri
    Martinez, Raul
    Valkama, Mikko
    Cavallaro, Joseph R.
    [J]. 2015 49TH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS, 2015, : 1180 - 1186