Towards Disentangled Speech Representations

被引:1
|
作者
Peyser, Cal [1 ,2 ]
Huang, Ronny [2 ]
Rosenberg, Andrew [2 ]
Sainath, Tara N. [2 ]
Picheny, Michael [1 ]
Cho, Kyunghyun [1 ]
机构
[1] NYU, Ctr Data Sci, New York, NY 10011 USA
[2] Google Inc, Mountain View, CA 94043 USA
来源
关键词
D O I
10.21437/Interspeech.2022-30
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized "disentanglement", where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint modeling of ASR and TTS, and seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We present empirical evidence that successfully finding such a representation is tied to the randomness inherent in training. We then make the observation that these desired, disentangled solutions to the optimization problem possess unique statistical properties. Finally, we show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task. These observations motivate a novel approach to learning effective audio representations.
引用
收藏
页码:3603 / 3607
页数:5
相关论文
共 50 条
  • [31] A Commentary on the Unsupervised Learning of Disentangled Representations
    Locatello, Francesco
    Bauer, Stefan
    Lucie, Mario
    Raetsch, Gunnar
    Gelly, Sylvain
    Schoelkopf, Bernhard
    Bachem, Olivier
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 13681 - 13684
  • [32] On Learning Disentangled Representations for Gait Recognition
    Zhang, Ziyuan
    Tran, Luan
    Liu, Feng
    Liu, Xiaoming
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (01) : 345 - 360
  • [33] Better representations: Invariant, disentangled and reusable
    Montavon, Grégoire
    Müller, Klaus-Robert
    Montavon, G. (gregoire.montavon@tu-berlin.de), 1600, Springer Verlag, Tiergartenstrasse 17, Heidelberg, D-69121, Germany (7700 LECTURE NO): : 559 - 560
  • [34] Disentangled Representations via Synergy Minimization
    Steeg, Greg Ver
    Brekelmans, Rob
    Harutyunyan, Hrayr
    Galstyan, Aram
    2017 55TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2017, : 180 - 187
  • [35] BlobGAN: Spatially Disentangled Scene Representations
    Epstein, Dave
    Park, Taesung
    Zhang, Richard
    Shechtman, Eli
    Efros, Alexei A.
    COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 616 - 635
  • [36] Image Generation and Translation with Disentangled Representations
    Hinz, Tobias
    Wermter, Stefan
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [37] Learning Disentangled Representations with the Wasserstein Autoencoder
    Gaujac, Benoit
    Feige, Ilya
    Barber, David
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2021: RESEARCH TRACK, PT III, 2021, 12977 : 69 - 84
  • [38] TOWARDS LEARNING NUISANCE-FREE REPRESENTATIONS OF SPEECH
    Liu, Lixing
    Ghosh, Sayan
    Scherer, Stefan
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6817 - 6821
  • [39] Animating Face using Disentangled Audio Representations
    Mittal, Gaurav
    Wang, Baoyuan
    2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 3279 - 3287
  • [40] Learning Debiased and Disentangled Representations for Semantic Segmentation
    Chu, Sanghyeok
    Kim, Dongwan
    Han, Bohyung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34