Audio-Visual Speaker Verification via Joint Cross-Attention

被引：0

作者：

Rajasekhar, Gnana Praveen ^{[1
]}

Alam, Jahangir ^{[1
]}

机构：

[1] Comp Res Inst Montreal, Montreal, PQ H3N 1M3, Canada

来源：

SPEECH AND COMPUTER, SPECOM 2023, PT II | 2023年 / 14339卷

关键词：

Cross-attention; Audio-visual fusion; Speaker verification; Joint-attention;

D O I：

10.1007/978-3-031-48312-7_2

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speaker verification has been widely explored using speech signals, which has shown significant improvement using deep models. Recently, there has been a surge in exploring faces and voices as they can offer more complementary and comprehensive information than relying only on a single modality of speech signals. Though current methods in the literature on the fusion of faces and voices have shown improvement over that of individual face or voice modalities, the potential of audiovisual fusion is not fully explored for speaker verification. Most of the existing methods based on audio-visual fusion either rely on score-level fusion or simple feature concatenation. In this work, we have explored cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intramodal as well inter-modal relationships among the faces and voices. We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification. The performance of the proposed approach has been evaluated on the Voxceleb1 dataset. Results show that the proposed approach can significantly outperform the state-of-the-art methods of audio-visual fusion for speaker verification.

引用

页码：18 / 31

页数：14

共 50 条

[31] Improved audio-visual speaker recognition via the use of a hybrid combination strategy
Lucey, S
Chen, TH
[J]. AUDIO-AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 929 - 936
[32] SPOOFING DETECTION VIA SIMULTANEOUS VERIFICATION OF AUDIO-VISUAL SYNCHRONICITY AND TRANSCRIPTION
Schoenherr, Lea
Zeiler, Steffen
Kolossa, Dorothea
[J]. 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2017, : 591 - 598
[33] Temporal Cross-Modal Attention for Audio-Visual Event Localization
Nagasaki Y.
Hayashi M.
Kaneko N.
Aoki Y.
[J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
[34] Audio-Visual Salieny Network with Audio Attention Module
Cheng, Shuaiyang
Gao, Xing
Song, Liang
Xiahou, Jianbing
[J]. PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,
[35] A Visual Signal Reliability for Robust Audio-Visual Speaker Identification
Tariquzzaman, Md.
Kim, Jin Young
Na, Seung You
Kim, Hyoung-Gook
Har, Dongsoo
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (10): : 2052 - 2055
[36] Neural Speaker Extraction with Speaker-Speech Cross-Attention Network
Wang, Wupeng
Xu, Chenglin
Ge, Meng
Li, Haizhou
[J]. INTERSPEECH 2021, 2021, : 3535 - 3539
[37] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
Fecher, Natalie
[J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
[38] Joint watermarking of audio-visual data
Dittmann, J
Steinebach, M
[J]. 2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 601 - 606
[39] AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION
Roth, Joseph
Chaudhuri, Sourish
Klejch, Ondrej
Marvin, Radhika
Gallagher, Andrew
Kaver, Liat
Ramaswamy, Sharadh
Stopczynski, Arkadiusz
Schmid, Cordelia
Xi, Zhonghua
Pantofaru, Caroline
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4492 - 4496
[40] Joint Audio-Visual Deepfake Detection
Zhou, Yipin
Lim, Ser-Nam
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 14780 - 14789

← 1 2 3 4 5 →