Non-Parallel Voice Conversion System With WaveNet Vocoder and Collapsed Speech Suppression

被引:5
|
作者
Wu, Yi-Chiao [1 ]
Tobing, Patrick Lumban [1 ]
Kobayashi, Kazuhiro [2 ]
Hayashi, Tomoki [3 ]
Toda, Tomoki [2 ]
机构
[1] Nagoya Univ, Grad Sch Informat, Nagoya, Aichi 4648601, Japan
[2] Nagoya Univ, Informat Technol Ctr, Nagoya, Aichi 4648601, Japan
[3] Nagoya Univ, Grad Sch Informat Sci, Nagoya, Aichi 4648601, Japan
来源
IEEE ACCESS | 2020年 / 8卷
基金
日本学术振兴会; 日本科学技术振兴机构;
关键词
Non-parallel voice conversion; WaveNet vocoder; collapsed speech segment detection; linear predictive coding distribution constraint; NEURAL-NETWORKS; REPRESENTATIONS; GENERATION;
D O I
10.1109/ACCESS.2020.2984007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we integrate a simple non-parallel voice conversion (VC) system with a WaveNet (WN) vocoder and a proposed collapsed speech suppression technique. The effectiveness of WN as a vocoder for generating high-fidelity speech waveforms on the basis of acoustic features has been confirmed in recent works. However, when combining the WN vocoder with a VC system, the distorted acoustic features, acoustic and temporal mismatches, and exposure bias usually lead to significant speech quality degradation, making WN generate some very noisy speech segments called collapsed speech. To tackle the problem, we take conventional-vocoder-generated speech as the reference speech to derive a linear predictive coding distribution constraint (LPCDC) to avoid the collapsed speech problem. Furthermore, to mitigate the negative effects introduced by the LPCDC, we propose a collapsed speech segment detector (CSSD) to ensure that the LPCDC is only applied to the problematic segments to limit the loss of quality to short periods. Objective and subjective evaluations are conducted, and the experimental results confirm the effectiveness of the proposed method, which further improves the speech quality of our previous non-parallel VC system submitted to Voice Conversion Challenge 2018.
引用
收藏
页码:62094 / 62106
页数:13
相关论文
共 50 条
  • [21] BOOTSTRAPPING NON-PARALLEL VOICE CONVERSION FROM SPEAKER-ADAPTIVE TEXT-TO-SPEECH
    Luong, Hieu-Thi
    Yamagishi, Junichi
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 200 - 207
  • [22] A Novel Iterative Speaker Model Alignment Method from Non-Parallel Speech for Voice Conversion
    Song, Peng
    Zheng, Wenming
    Zhang, Xinran
    Jin, Yun
    Zha, Cheng
    Xin, Minghai
    [J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2015, E98A (10) : 2178 - 2181
  • [23] StyleVC: Non-Parallel Voice Conversion with Adversarial Style Generalization
    Hwang, In-Sun
    Lee, Sang-Hoon
    Lee, Seong-Whan
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 23 - 30
  • [24] ADAPTIVE WAVENET VOCODER FOR RESIDUAL COMPENSATION IN GAN-BASED VOICE CONVERSION
    Sisman, Berrak
    Zhang, Mingyang
    Sakti, Sakriani
    Li, Haizhou
    Nakamura, Satoshi
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 282 - 289
  • [25] An evaluation of voice conversion with neural network spectral mapping models and WaveNet vocoder
    Tobing, Patrick Lumban
    Wu, Yi-Chiao
    Hayashi, Tomoki
    Kobayashi, Kazuhiro
    Toda, Tomoki
    [J]. APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2020, 9 (01)
  • [26] Non-parallel Voice Conversion using Generative Adversarial Networks
    Hasunuma, Yuta
    Hirayama, Chiaki
    Kobayashi, Masayuki
    Nagao, Tomoharu
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2018, : 1635 - 1640
  • [27] NON-PARALLEL TRAINING FOR VOICE CONVERSION BASED ON ADAPTATION METHOD
    Song, Peng
    Zheng, Wenming
    Zhao, Li
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 6905 - 6909
  • [28] A novel method for voice conversion based on non-parallel corpus
    Sayadian A.
    Mozaffari F.
    [J]. International Journal of Speech Technology, 2017, 20 (3) : 587 - 592
  • [29] Reimagining speech: a scoping review of deep learning-based methods for non-parallel voice conversion
    Bargum, Anders R.
    Serafin, Stefania
    Erkut, Cumhur
    [J]. FRONTIERS IN SIGNAL PROCESSING, 2024, 4
  • [30] Investigation of Text-to-Speech-based Synthetic Parallel Data for Sequence-to-Sequence Non-Parallel Voice Conversion
    Ma, Ding
    Huang, Wen-Chin
    Toda, Tomoki
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 870 - 877