Many-to-Many Voice Transformer Network

被引:20
|
作者
Kameoka, Hirokazu [1 ]
Huang, Wen-Chin [2 ]
Tanaka, Kou [1 ]
Kaneko, Takuhiro [1 ]
Hojo, Nobukatsu [1 ]
Toda, Tomoki [2 ]
机构
[1] NTT Corp, NTT Commun Sci Labs, Atsugi, Kanagawa 2430198, Japan
[2] Nagoya Univ, Nagoya, Aichi 4648601, Japan
关键词
Training; Acoustics; Computational modeling; Decoding; Data models; Training data; Computer architecture; Attention; many-to-many VC; sequence-to-sequence learning; voice conversion (VC); transformer network; CONVOLUTIONAL SEQUENCE; CONVERSION; SPEECH;
D O I
10.1109/TASLP.2020.3047262
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework, which enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech. We previously proposed an S2S-based VC method using a transformer network architecture called the voice transformer network (VTN). The original VTN was designed to learn only a mapping of speech feature sequences from one speaker to another. Here, the main idea we propose is an extension of the original VTN that can simultaneously learn mappings among multiple speakers. This extension, called the many-to-many VTN, enables us to fully use available training data collected from multiple speakers by capturing common latent features that can be shared across different speakers. It also allows us to introduce a training loss called the identity mapping loss to ensure that the input feature sequence will remain unchanged when the source and target speaker indices are the same. Using this particular loss for model training has been found to be extremely effective in improving the performance of the model at test time. We conducted speaker identity conversion experiments and found that our model obtained higher sound quality and speaker similarity than baseline methods. We also found that our model, with a slight modification to its architecture, can handle any-to-many conversion tasks reasonably well.
引用
收藏
页码:656 / 670
页数:15
相关论文
共 50 条
  • [1] Many-to-many eigenvoice conversion with reference voice
    Ohtani, Yamato
    Toda, Tomoki
    Saruwatari, Hiroshi
    Shikano, Kiyohiro
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1591 - 1594
  • [2] Accent and Speaker Disentanglement in Many-to-many Voice Conversion
    Wang, Zhichao
    Ge, Wenshuo
    Wang, Xiong
    Yang, Shan
    Gan, Wendong
    Chen, Haitao
    Li, Hai
    Xie, Lei
    Li, Xiulin
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [3] Diverse style oriented many-to-many emotional voice conversion
    Zhou, Jian
    Luo, Xiangyu
    Wang, Huabin
    Zheng, Wenming
    Tao, Liang
    Shengxue Xuebao/Acta Acustica, 2024, 49 (06): : 1297 - 1303
  • [4] Many-to-many voice conversion with sentence embedding based on VAACGAN
    Li, Yanping
    Cao, Pan
    Shi, Yang
    Zhang, Yan
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2021, 47 (03): : 500 - 508
  • [5] Many-to-many Cross-lingual Voice Conversion with a Jointly Trained Speaker Embedding Network
    Zhou, Yi
    Tian, Xiaohai
    Das, Rohan Kumar
    Li, Haizhou
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1282 - 1287
  • [6] Distributed Many-to-Many Mapping Algorithm in the Hypercube Network
    Han, Seung Chul
    NCM 2008: 4TH INTERNATIONAL CONFERENCE ON NETWORKED COMPUTING AND ADVANCED INFORMATION MANAGEMENT, VOL 2, PROCEEDINGS, 2008, : 406 - 411
  • [7] Many-to-many voice conversion experiments using a Korean speech corpus
    Yook, Dongsuk
    Seo, HyungJin
    Ko, Bonggu
    Yoo, In-Chul
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2022, 41 (03): : 351 - 358
  • [8] Evaluation of a Singing Voice Conversion Method Based on Many-to-Many Eigenvoice Conversion
    Doi, Hironori
    Toda, Tomoki
    Nakano, Tomoyasu
    Goto, Masataka
    Nakamura, Satoshi
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1066 - 1070
  • [9] Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder
    Luang, Manh
    Viet Anh Tran
    INTERSPEECH 2021, 2021, : 851 - 855
  • [10] Non-parallel Many-to-many Voice Conversion with PSR-StarGAN
    Li, Yanping
    Xu, Dongxiang
    Zhang, Yan
    Wang, Yang
    Chen, Binbin
    INTERSPEECH 2020, 2020, : 781 - 785