Duration Controllable Voice Conversion via Phoneme-Based Information Bottleneck

被引:0
|
作者
Lee, Sang-Hoon [1 ]
Noh, Hyeong-Rae [1 ]
Nam, Woo-Jeoung [2 ]
Lee, Seong-Whan [3 ]
机构
[1] Korea Univ, Dept Brain & Cognit Engn, Seoul 02841, South Korea
[2] Korea Univ, Dept Comp & Radio Commun Engn, Seoul 02841, South Korea
[3] Korea Univ, Dept Artificial Intelligence, Seoul 02841, South Korea
关键词
Speech processing; Decoding; Training; Generative adversarial networks; Speech; Timbre; Licenses; Information bottleneck; non-autoregressive model; voice style transfer; voice conversion; SPEECH SYNTHESIS; SPEAKER; TRANSFORMER; NETWORKS;
D O I
10.1109/TASLP.2022.3156757
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Several voice conversion (VC) methods using a simple autoencoder with a carefully designed information bottleneck have recently been studied. In general, they extract content information from a given speech through the information bottleneck between the encoder and the decoder, providing it to the decoder along with the target speaker information to generate the converted speech. However, their performance is highly dependent on the downsampling factor of an information bottleneck. In addition, such frame-by-frame conversion methods cannot convert speaking styles associated with the length of utterance, such as the duration. In this paper, we propose a novel duration controllable voice conversion (DCVC) model, which can transfer the speaking style and control the speed of the converted speech through a phoneme-based information bottleneck. The proposed information bottleneck does not need to find an appropriate downsampling factor, achieving a better audio quality and VC performance. In our experiments, DCVC outperformed the baseline models with a 3.78 MOS and a 3.83 similarity score. It can also smoothly control the speech duration while achieving a 39.35x speedup compared with a Seq2seq-based VC in terms of the inference speed.
引用
收藏
页码:1173 / 1183
页数:11
相关论文
共 31 条
  • [1] Phoneme-based spectral voice conversion using temporal decomposition and Gaussian mixture model
    Nguyen, Binh Phu
    Akagi, Masato
    [J]. 2008 SECOND INTERNATIONAL CONFERENCE ON COMMUNICATIONS AND ELECTRONICS, 2008, : 222 - 227
  • [2] High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for mandarin
    Liu, Kun
    Zhang, Jianping
    Yan, Yonghong
    [J]. FOURTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 4, PROCEEDINGS, 2007, : 410 - 414
  • [3] Phoneme-based speech recognition via fuzzy neural networks modeling and learning
    Kasabov, NK
    Kozma, R
    Watts, MJ
    [J]. INFORMATION SCIENCES, 1998, 110 (1-2) : 61 - 79
  • [4] Phoneme Background Model for Information Bottleneck based Speaker Diarization
    Yella, Sree Harsha
    Motlicek, Petr
    Bourlard, Herve
    [J]. 15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 597 - 601
  • [5] Controllable voice conversion based on quantization of voice factor scores
    Isako, Takumi
    Onishi, Kotaro
    Kishida, Takuya
    Nakashika, Toru
    [J]. PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1444 - 1448
  • [6] DISENTANGLING CONTENT AND FINE-GRAINED PROSODY INFORMATION VIA HYBRID ASR BOTTLENECK FEATURES FOR VOICE CONVERSION
    Zhao, Xintao
    Liu, Feng
    Song, Changhe
    Wu, Zhiyong
    Kang, Shiyin
    Tuo, Deyi
    Meng, Helen
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7022 - 7026
  • [7] Phoneme Hallucinator: One-Shot Voice Conversion via Set Expansion
    Shan, Siyuan
    Li, Yang
    Banerjee, Amartya
    Oliva, Junier B.
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 13, 2024, : 14910 - 14918
  • [8] CONTROLLABLE SPEECH REPRESENTATION LEARNING VIA VOICE CONVERSION AND AIC LOSS
    Wang, Yunyun
    Su, Jiaqi
    Finkelstein, Adam
    Jin, Zeyu
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6682 - 6686
  • [9] PHONEME CLUSTER BASED STATE MAPPING FOR TEXT-INDEPENDENT VOICE CONVERSION
    Zhang, Meng
    Tao, Jiaohua
    Nurminen, Jani
    Tian, Jilei
    Wang, Xia
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4281 - +
  • [10] DNN-Based Cross-Lingual Voice Conversion Using Bottleneck Features
    M. Kiran Reddy
    K. Sreenivasa Rao
    [J]. Neural Processing Letters, 2020, 51 : 2029 - 2042