Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

被引:1
|
作者
Benaroya, Laurent [1 ]
Obin, Nicolas [1 ]
Roebel, Axel [1 ]
机构
[1] Sorbonne Univ, Anal Synth Team, STMS, IRCAM,CNRS,French Minist Culture, F-75004 Paris, France
关键词
voice conversion; attribute manipulation; representation learning; information disentanglement; adversarial learning; cross-entropy; CONVERSION;
D O I
10.3390/e25020375
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity manipulation and presents an original neural architecture that allows the manipulation of voice attributes (e.g., gender and age). The proposed architecture is inspired by the fader network, transferring the same ideas to voice manipulation. The information conveyed by the speech signal is disentangled into interpretative voice attributes by means of minimizing adversarial loss to make the encoded information mutually independent while preserving the capacity to generate a speech signal from the disentangled codes. During inference for voice conversion, the disentangled voice attributes can be manipulated and the speech signal can be generated accordingly. For experimental evaluation, the proposed method is applied to the task of voice gender conversion using the freely available VCTK dataset. Quantitative measurements of mutual information between the variables of speaker identity and speaker gender show that the proposed architecture can learn gender-independent representation of speakers. Additional measurements of speaker recognition indicate that speaker identity can be recognized accurately from the gender-independent representation. Finally, a subjective experiment conducted on the task of voice gender manipulation shows that the proposed architecture can convert voice gender with very high efficiency and good naturalness.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Adversarial Learning of Disentangled and Generalizable Representations of Visual Attributes
    Oldfield, James
    Panagakis, Yannis
    Nicolaou, Mihalis A.
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (08) : 3498 - 3509
  • [2] Learning Interpretable Disentangled Representations Using Adversarial VAEs
    Sarhan, Mhd Hasan
    Eslami, Abouzar
    Navab, Nassir
    Albarqouni, Shadi
    DOMAIN ADAPTATION AND REPRESENTATION TRANSFER AND MEDICAL IMAGE LEARNING WITH LESS LABELS AND IMPERFECT DATA, DART 2019, MIL3ID 2019, 2019, 11795 : 37 - 44
  • [3] Structured Disentangled Representations
    Esmaeili, Babak
    Wu, Hao
    Jain, Sarthak
    Bozkurt, Alican
    Siddharth, N.
    Paige, Brooks
    Brooks, Dana H.
    Dy, Jennifer
    van de Meent, Jan-Willem
    22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
  • [4] An Adversarial Neuro-Tensorial Approach for Learning Disentangled Representations
    Wang, Mengjiao
    Shu, Zhixin
    Cheng, Shiyang
    Panagakis, Yannis
    Samaras, Dimitris
    Zafeiriou, Stefanos
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (6-7) : 743 - 762
  • [5] An Adversarial Neuro-Tensorial Approach for Learning Disentangled Representations
    Mengjiao Wang
    Zhixin Shu
    Shiyang Cheng
    Yannis Panagakis
    Dimitris Samaras
    Stefanos Zafeiriou
    International Journal of Computer Vision, 2019, 127 : 743 - 762
  • [6] Adversarial Robustness through Disentangled Representations
    Yang, Shuo
    Guo, Tianyu
    Wang, Yunhe
    Xu, Chang
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3145 - 3153
  • [7] LEARNING DISENTANGLED FEATURE REPRESENTATIONS FOR SPEECH ENHANCEMENT VIA ADVERSARIAL TRAINING
    Hou, Nana
    Xu, Chenglin
    Chng, Eng Siong
    Li, Haizhou
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 666 - 670
  • [8] Learning Structured Sparse Representations for Voice Conversion
    Ding, Shaojin
    Zhao, Guanlong
    Liberatore, Christopher
    Gutierrez-Osuna, Ricardo
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 343 - 354
  • [9] Learning Disentangled Representations for Recommendation
    Ma, Jianxin
    Zhou, Chang
    Cui, Peng
    Yang, Hongxia
    Zhu, Wenwu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [10] Learning Disentangled Discrete Representations
    Friede, David
    Reimers, Christian
    Stuckenschmidt, Heiner
    Niepert, Mathias
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT IV, 2023, 14172 : 593 - 609