Non-Parallel Training in Voice Conversion Using an Adaptive Restricted Boltzmann Machine

被引：51

作者：

Nakashika, Toru ^{[1
]}

Takiguchi, Tetsuya ^{[2
]}

Minami, Yasuhiro ^{[1
]}

机构：

[1] Univ Electrocommun, Grad Sch Informat Syst, Tokyo 1828585, Japan

[2] Kobe Univ, Org Adv Sci & Technol, Kobe, Hyogo 6578501, Japan

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2016年 / 24卷 / 11期

关键词：

Restricted Boltzmann machine; speaker adaptation; unsupervised training; voice conversion; NEURAL-NETWORKS; TRANSFORMATION; SPARSE;

D O I：

10.1109/TASLP.2016.2593263

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this paper, we present a voice conversion (VC) method that does not use any parallel data while training the model. VC is a technique where only speaker-specific information in source speech is converted while keeping the phonological information unchanged. Most of the existing VC methods rely on parallel data-pairs of speech data from the source and target speakers uttering the same sentences. However, the use of parallel data in training causes several problems: 1) the data used for the training are limited to the predefined sentences, 2) the trained model is only applied to the speaker pair used in the training, and 3) mismatches in alignment may occur. Although it is, thus, fairly preferable in VC not to use parallel data, a nonparallel approach is considered difficult to learn. In our approach, we achieve nonparallel training based on a speaker adaptation technique and capturing latent phonological information. This approach assumes that speech signals are produced from a restricted Boltzmann machine-based probabilistic model, where phonological information and speaker-related information are defined explicitly. Speaker-independent and speaker-dependent parameters are simultaneously trained under speaker adaptive training. In the conversion stage, a given speech signal is decomposed into phonological and speaker-related information, the speaker-related information is replaced with that of the desired speaker, and then voice-converted speech is obtained by mixing the two. Our experimental results showed that our approach outperformed another nonparallel approach, and produced results similar to those of the popular conventional Gaussian mixture models-based method that used parallel data in subjective and objective criteria.

引用

页码：2032 / 2045

页数：14

共 50 条

[11] Non-Parallel Voice Conversion for ASR Augmentation
Wang, Gary
Rosenberg, Andrew
Ramabhadran, Bhuvana
Biadsy, Fadi
Huang, Yinghui
Emond, Jesse
Mengibar, Pedro Moreno
[J]. INTERSPEECH 2022, 2022, : 3408 - 3412
[12] VAW-GAN for Singing Voice Conversion with Non-parallel Training Data
Lu, Junchen
Zhou, Kun
Sisman, Berrak
Li, Haizhou
[J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 514 - 519
[13] Non-parallel Voice Conversion using Weighted Generative Adversarial Networks
Paul, Dipjyoti
Pantazis, Yannis
Stylianou, Yannis
[J]. INTERSPEECH 2019, 2019, : 659 - 663
[14] Non-parallel training for voice conversion using background-based alignment of GMMs and INCA algorithm
Ghorbandoost, Mostafa
Saba, Valiallah
[J]. IET SIGNAL PROCESSING, 2017, 11 (08) : 998 - 1005
[15] NOVEL METRIC LEARNING FOR NON-PARALLEL VOICE CONVERSION
Shah, Nirmesh J.
Patil, Hemant A.
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3722 - 3726
[16] CVC: Contrastive Learning for Non-parallel Voice Conversion
Li, Tingle
Liu, Yichen
Hu, Chenxu
Zhao, Hang
[J]. INTERSPEECH 2021, 2021, : 1324 - 1328
[17] Frame Labeling and Mapping for Non-parallel Voice Conversion
Dong, Minghui
Yang, Chenyu
Ehnes, Jochen Walter
Lu, Yanfeng
Ming, Huaiping
Huang, Dongyan
[J]. 2017 IEEE 2ND INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP), 2017, : 361 - 365
[18] Non-parallel Voice Conversion with Generative Attentional Networks
Chiu, Tse Wei
Guo, You Sheng
Chang, Pao-Chi
[J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 141 - 145
[19] Non-Parallel Voice Conversion with Cyclic Variational Autoencoder
Tobing, Patrick Lumban
Wu, Yi-Chiao
Hayashi, Tomoki
Kobayashi, Kazuhiro
Toda, Tomoki
[J]. INTERSPEECH 2019, 2019, : 674 - 678
[20] Transferring Source Style in Non-Parallel Voice Conversion
Liu, Songxiang
Cao, Yuewen
Kang, Shiyin
Hu, Na
Liu, Xunying
Su, Dan
Yu, Dong
Meng, Helen
[J]. INTERSPEECH 2020, 2020, : 4721 - 4725

← 1 2 3 4 5 →