DISENTANGLED SPEECH EMBEDDINGS USING CROSS-MODAL SELF-SUPERVISION

被引：0

作者：

Nagrani, Arsha ^{[1
]}

Chung, Joon Son ^{[1
,2
]}

Albanie, Samuel ^{[1
]}

Zisserman, Andrew ^{[1
]}

机构：

[1] Univ Oxford, Dept Engn Sci, Visual Geometry Grp, Oxford, England

[2] Naver Corp, Seongnam Si, South Korea

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

基金：

英国工程与自然科学研究理事会;

关键词：

speaker recognition; cross-modal learning; self-supervised machine learning;

D O I：

10.1109/icassp40776.2020.9054057

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart-without annotation-the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust. We train our method on a large-scale audio-visual dataset of talking heads 'in the wild', and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.

引用

页码：6829 / 6833

页数：5

共 50 条

[1] Seeing voices and hearing voices: Learning discriminative embeddings using cross-modal self-supervision
Chung, Soo-Whan
Kang, Hong-Goo
Chung, Joon Son
[J]. INTERSPEECH 2020, 2020, : 3486 - 3490
[2] Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval
Xu, Xing
Lu, Huimin
Song, Jingkuan
Yang, Yang
Shen, Heng Tao
Li, Xuelong
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (06) : 2400 - 2413
[3] Disentangled Self-Supervision in Sequential Recommenders
Ma, Jianxin
Zhou, Chang
Yang, Hongxia
Cui, Peng
Wang, Xin
Zhu, Wenwu
[J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 483 - 491
[4] SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation
Abudukelimu, Halidanmu
Chen, Jishang
Liang, Yunze
Abulizi, Abudukelimu
Yasen, Alimujiang
[J]. APPLIED INTELLIGENCE, 2024, 54 (05) : 4140 - 4152
[5] SymforNet: application of cross-modal information correspondences based on self-supervision in symbolic music generation
Halidanmu Abudukelimu
Jishang Chen
Yunze Liang
Abudukelimu Abulizi
Alimujiang Yasen
[J]. Applied Intelligence, 2024, 54 : 4140 - 4152
[6] Boosting Cross-Domain Speech Recognition With Self-Supervision
Zhu, Han
Cheng, Gaofeng
Wang, Jindong
Hou, Wenxin
Zhang, Pengyuan
Yan, Yonghong
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 471 - 485
[7] Diachronic Cross-modal Embeddings
Semedo, David
Magalhaes, Joao
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2061 - 2069
[8] Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision
Zhang, Zeyang
Wang, Xin
Zhang, Ziwei
Shen, Guangyao
Shen, Shiqi
Zhu, Wenwu
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[9] Probabilistic Embeddings for Cross-Modal Retrieval
Chun, Sanghyuk
Oh, Seong Joon
de Rezende, Rafael Sampaio
Kalantidis, Yannis
Larlus, Diane
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8411 - 8420
[10] Face-Voice Matching using Cross-modal Embeddings
Horiguchi, Shota
Kanda, Naoyuki
Nagamatsu, Kenji
[J]. PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1011 - 1019

← 1 2 3 4 5 →