Audio-Visual Speaker Verification via Joint Cross-Attention

被引：0

作者：

Rajasekhar, Gnana Praveen ^{[1
]}

Alam, Jahangir ^{[1
]}

机构：

[1] Comp Res Inst Montreal, Montreal, PQ H3N 1M3, Canada

来源：

SPEECH AND COMPUTER, SPECOM 2023, PT II | 2023年 / 14339卷

关键词：

Cross-attention; Audio-visual fusion; Speaker verification; Joint-attention;

D O I：

10.1007/978-3-031-48312-7_2

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speaker verification has been widely explored using speech signals, which has shown significant improvement using deep models. Recently, there has been a surge in exploring faces and voices as they can offer more complementary and comprehensive information than relying only on a single modality of speech signals. Though current methods in the literature on the fusion of faces and voices have shown improvement over that of individual face or voice modalities, the potential of audiovisual fusion is not fully explored for speaker verification. Most of the existing methods based on audio-visual fusion either rely on score-level fusion or simple feature concatenation. In this work, we have explored cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intramodal as well inter-modal relationships among the faces and voices. We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification. The performance of the proposed approach has been evaluated on the Voxceleb1 dataset. Results show that the proposed approach can significantly outperform the state-of-the-art methods of audio-visual fusion for speaker verification.

引用

页码：18 / 31

页数：14

共 50 条

[1] Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
Qian, Xinyuan
Wang, Zhengdong
Wang, Jiadong
Guan, Guohui
Li, Haizhou
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 550 - 562
[2] A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
Praveen, R. Gnana
de Melo, Wheidima Carneiro
Ullah, Nasib
Aslam, Haseeb
Zeeshan, Osama
Denorme, Theo
Pedersoli, Marco
Koerich, Alessandro L.
Bacon, Simon
Cardinal, Patrick
Granger, Eric
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2485 - 2494
[3] CASA-Net: Cross-attention and Self-attention for End-to-End Audio-visual Speaker Diarization
Zhou, Haodong
Li, Tao
Wang, Jie
Li, Lin
Hong, Qingyang
[J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 102 - 106
[4] Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
Praveen, R. Gnana
Cardinal, Patrick
Granger, Eric
[J]. IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2023, 5 (03): : 360 - 373
[5] Dynamic visual features for audio-visual speaker verification
Dean, David
Sridharan, Sridha
[J]. COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02): : 136 - 149
[6] Multi-scale network with shared cross-attention for audio-visual correlation learning
Zhang, Jiwei
Yu, Yi
Tang, Suhua
Li, Wei
Wu, Jianming
[J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (27): : 20173 - 20187
[7] A MULTI-VIEW APPROACH TO AUDIO-VISUAL SPEAKER VERIFICATION
Sari, Leda
Singh, Kritika
Zhou, Jiatong
Torresani, Lorenzo
Singhal, Nayan
Saraf, Yatharth
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6194 - 6198
[8] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
Gebru, Israel D.
Alameda-Pineda, Xavier
Horaud, Radu
Forbes, Florence
[J]. 2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
[9] Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder
Liu Xin
Li Heyang
Zhong Bineng
Du Jixiang
[J]. JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2018, 40 (07) : 1635 - 1642
[10] Audio-Visual Fusion Based on Interactive Attention for Person Verification
Jing, Xuebin
He, Liang
Song, Zhida
Wang, Shaolei
[J]. SENSORS, 2023, 23 (24)

← 1 2 3 4 5 →