Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

被引:58
|
作者
Polyak, Adam [1 ,2 ]
Adi, Yossi [1 ]
Copet, Jade [3 ]
Kharitonov, Eugene [3 ]
Lakhotia, Kushal [4 ]
Hsu, Wei-Ning [4 ]
Mohamed, Abdelrahman [4 ]
Dupoux, Emmanuel [3 ,5 ]
机构
[1] Facebook AI Res, Tel Aviv, Israel
[2] Tel Aviv Univ, Tel Aviv, Israel
[3] Facebook AI Res, Paris, France
[4] Facebook AI Res, Menlo Pk, CA USA
[5] EHESS, Paris, France
来源
关键词
speech generation; speech resynthesis; self-supervised learning; speech codec;
D O I
10.21437/Interspeech.2021-475
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under the following link: resynthesis-ssl.github.io.
引用
收藏
页码:3615 / 3619
页数:5
相关论文
共 50 条
  • [1] Audio-guided self-supervised learning for disentangled visual speech representations
    Feng, Dalu
    Yang, Shuang
    Shan, Shiguang
    Chen, Xilin
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (06)
  • [2] Audio-guided self-supervised learning for disentangled visual speech representations
    FENG Dalu
    YANG Shuang
    SHAN Shiguang
    CHEN Xilin
    [J]. Frontiers of Computer Science., 2024, 18 (06)
  • [3] SIMILARITY ANALYSIS OF SELF-SUPERVISED SPEECH REPRESENTATIONS
    Chung, Yu-An
    Belinkov, Yonatan
    Glass, James
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3040 - 3044
  • [4] Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition
    Atmaja, Bagus Tris
    Sasou, Akira
    [J]. IEEE ACCESS, 2022, 10 : 124396 - 124407
  • [5] Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction
    Mu, Zhaoxi
    Yang, Xinyu
    Sun, Sining
    Yang, Qing
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18815 - 18823
  • [6] ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Regeneration
    Hsu, Wei-Ning
    Remez, Tal
    Shi, Bowen
    Donley, Jacob
    Adi, Yossi
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18795 - 18805
  • [7] Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations
    Choi, Hyeong-Seok
    Lee, Juheon
    Kim, Wansoo
    Lee, Jie Hwan
    Heo, Hoon
    Lee, Kyogu
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [8] Phonetic Analysis of Self-supervised Representations of English Speech
    Wells, Dan
    Tang, Hao
    Richmond, Korin
    [J]. INTERSPEECH 2022, 2022, : 3583 - 3587
  • [9] Federated Self-supervised Speech Representations: Are We There Yet?
    Gao, Yan
    Fernandez-Marques, Javier
    Parcollet, Titouan
    Mehrotra, Abhinav
    Lane, Nicholas D.
    [J]. INTERSPEECH 2022, 2022, : 3809 - 3813
  • [10] The Efficacy of Self-Supervised Speech Models as Audio Representations
    Wu, Tung-Yu
    Hsu, Tsu-Yuan
    Li, Chen-An
    Lin, Tzu-Han
    Lee, Hung-yi
    [J]. HEAR: HOLISTIC EVALUATION OF AUDIO REPRESENTATIONS, VOL 166, 2021, 166 : 90 - 110