DISENTANGLED SPEECH EMBEDDINGS USING CROSS-MODAL SELF-SUPERVISION

被引:0
|
作者
Nagrani, Arsha [1 ]
Chung, Joon Son [1 ,2 ]
Albanie, Samuel [1 ]
Zisserman, Andrew [1 ]
机构
[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England
[2] Naver Corp, Seongnam Si, South Korea
基金
英国工程与自然科学研究理事会;
关键词
speaker recognition; cross-modal learning; self-supervised machine learning;
D O I
10.1109/icassp40776.2020.9054057
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart-without annotation-the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust. We train our method on a large-scale audio-visual dataset of talking heads 'in the wild', and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.
引用
收藏
页码:6829 / 6833
页数:5
相关论文
共 50 条
  • [1] Seeing voices and hearing voices: Learning discriminative embeddings using cross-modal self-supervision
    Chung, Soo-Whan
    Kang, Hong-Goo
    Chung, Joon Son
    [J]. INTERSPEECH 2020, 2020, : 3486 - 3490
  • [2] Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval
    Xu, Xing
    Lu, Huimin
    Song, Jingkuan
    Yang, Yang
    Shen, Heng Tao
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (06) : 2400 - 2413
  • [3] Disentangled Self-Supervision in Sequential Recommenders
    Ma, Jianxin
    Zhou, Chang
    Yang, Hongxia
    Cui, Peng
    Wang, Xin
    Zhu, Wenwu
    [J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 483 - 491
  • [4] SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation
    Abudukelimu, Halidanmu
    Chen, Jishang
    Liang, Yunze
    Abulizi, Abudukelimu
    Yasen, Alimujiang
    [J]. APPLIED INTELLIGENCE, 2024, 54 (05) : 4140 - 4152
  • [5] SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation
    Halidanmu Abudukelimu
    Jishang Chen
    Yunze Liang
    Abudukelimu Abulizi
    Alimujiang Yasen
    [J]. Applied Intelligence, 2024, 54 : 4140 - 4152
  • [6] Boosting Cross-Domain Speech Recognition With Self-Supervision
    Zhu, Han
    Cheng, Gaofeng
    Wang, Jindong
    Hou, Wenxin
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 471 - 485
  • [7] Diachronic Cross-modal Embeddings
    Semedo, David
    Magalhaes, Joao
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2061 - 2069
  • [8] Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision
    Zhang, Zeyang
    Wang, Xin
    Zhang, Ziwei
    Shen, Guangyao
    Shen, Shiqi
    Zhu, Wenwu
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [9] Probabilistic Embeddings for Cross-Modal Retrieval
    Chun, Sanghyuk
    Oh, Seong Joon
    de Rezende, Rafael Sampaio
    Kalantidis, Yannis
    Larlus, Diane
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8411 - 8420
  • [10] Face-Voice Matching using Cross-modal Embeddings
    Horiguchi, Shota
    Kanda, Naoyuki
    Nagamatsu, Kenji
    [J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1011 - 1019