Any-to-any voice conversion using representation separation auto-encoder

被引:0
|
作者
Jian Z. [1 ]
Zhang Z. [1 ]
机构
[1] School of Communication Engineering, Hangzhou Dianzi University, Hangzhou
来源
基金
中国国家自然科学基金;
关键词
adaptive instance normalization; representation separation; self-content loss; self-speaker loss; voice conversion;
D O I
10.11959/j.issn.1000-436x.2024044
中图分类号
学科分类号
摘要
In view of the problem that it was difficult to separate speaker personality characteristics from semantic content information in any-to-any voice conversion under non-parallel corpus, which led to unsatisfied performance, a voice conversion method, called RSAE-VC (representation separation auto-encoder voice conversion) was proposed. The speaker’s personality characteristics in the speech were regarded as time invariant and the content information as time variant, and the instance normalization and activation guidance layer were used in the encoder to separate them from each other. Then the content information of the source speech and the personality characteristics of the target one was utilized to synthesize the converted speech by the decoder. The experimental results demonstrate that RSAE-VC has an average reduction of 3.11% and 2.41% in Mel cepstral distance and root mean square error of pitch frequency respectively, and has an increasement of 5.22% in MOS and 8.45% in ABX, compared with the AGAIN-VC (activation guidance and adaptive instance normalization voice conversion) method. In RSAE-VC, self-content loss is applied to make the converted speech reserve more content information, and self-speaker loss is used to separate the speaker personality characteristics from the speech better, which ensure the speaker personality characteristics be left in the content information as little as possible, and the conversion performance is improved. © 2024 Editorial Board of Journal on Communications. All rights reserved.
引用
收藏
页码:162 / 172
页数:10
相关论文
共 28 条
  • [1] SISMAN B, YAMAGISHI J, KING S, Et al., An overview of voice conversion and its challenges: from statistical modeling to deep learning, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, pp. 132-157, (2021)
  • [2] MOUCHTARIS A, AGIOMYRGIANNAKIS Y, STYLIANOU Y., Conditional vector quantization for voice conversion, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 505-508, (2007)
  • [3] AIHARA R, TAKASHIMA R, TAKIGUCHI T, Et al., GMM-based emotional voice conversion using spectrum and prosody features, American Journal of Signal Processing, 2, 5, pp. 134-138, (2012)
  • [4] HELANDER E, SILEN H, VIRTANEN T, Et al., Voice conversion using dynamic kernel partial least squares regression, IEEE Transactions on Audio, Speech, and Language Processing, 20, 3, pp. 806-817, (2012)
  • [5] WU Z Z, VIRTANEN T, CHNG E S, Et al., Exemplar-based sparse representation with residual compensation for voice conversion, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 10, pp. 1506-1521, (2014)
  • [6] SUN L F, LI K, WANG H, Et al., Phonetic posterior grams for many-to-one voice conversion without parallel data training, Proceedings of IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6, (2016)
  • [7] MURAKAMI H, HARA S, ABE M., DNN-based voice conversion with auxiliary phonemic information to improve intelligibility of glossectomy patients' speech, Proceedings of Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 138-142, (2019)
  • [8] ALAA Y, ALFONSE M, AREF M M., A survey on generative adversarial networks based models for many-to-many non-parallel voice conversion, Proceedings of 5th International Conference on Computing and Informatics (ICCI), pp. 221-226, (2022)
  • [9] KANEKO T, KAMEOKA H, TANAKA K, Et al., CycleGAN-VC2: improved cyclegan-based non-parallel voice conversion, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6820-6824, (2019)
  • [10] KAMEOKA H, KANEKO T, TANAKA K, Et al., StarGAN-VC: non-parallel many-to-many voice conversion using star generative adversarial networks, Proceedings of IEEE Spoken Language Technology Workshop (SLT), pp. 266-273, (2018)