Audio-Visual Speaker Verification via Joint Cross-Attention

被引:0
|
作者
Rajasekhar, Gnana Praveen [1 ]
Alam, Jahangir [1 ]
机构
[1] Comp Res Inst Montreal, Montreal, PQ H3N 1M3, Canada
来源
关键词
Cross-attention; Audio-visual fusion; Speaker verification; Joint-attention;
D O I
10.1007/978-3-031-48312-7_2
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker verification has been widely explored using speech signals, which has shown significant improvement using deep models. Recently, there has been a surge in exploring faces and voices as they can offer more complementary and comprehensive information than relying only on a single modality of speech signals. Though current methods in the literature on the fusion of faces and voices have shown improvement over that of individual face or voice modalities, the potential of audiovisual fusion is not fully explored for speaker verification. Most of the existing methods based on audio-visual fusion either rely on score-level fusion or simple feature concatenation. In this work, we have explored cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intramodal as well inter-modal relationships among the faces and voices. We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification. The performance of the proposed approach has been evaluated on the Voxceleb1 dataset. Results show that the proposed approach can significantly outperform the state-of-the-art methods of audio-visual fusion for speaker verification.
引用
收藏
页码:18 / 31
页数:14
相关论文
共 50 条
  • [1] Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
    Qian, Xinyuan
    Wang, Zhengdong
    Wang, Jiadong
    Guan, Guohui
    Li, Haizhou
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 550 - 562
  • [2] A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
    Praveen, R. Gnana
    de Melo, Wheidima Carneiro
    Ullah, Nasib
    Aslam, Haseeb
    Zeeshan, Osama
    Denorme, Theo
    Pedersoli, Marco
    Koerich, Alessandro L.
    Bacon, Simon
    Cardinal, Patrick
    Granger, Eric
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2485 - 2494
  • [3] CASA-Net: Cross-attention and Self-attention for End-to-End Audio-visual Speaker Diarization
    Zhou, Haodong
    Li, Tao
    Wang, Jie
    Li, Lin
    Hong, Qingyang
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 102 - 106
  • [4] Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
    Praveen, R. Gnana
    Cardinal, Patrick
    Granger, Eric
    [J]. IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2023, 5 (03): : 360 - 373
  • [5] Dynamic visual features for audio-visual speaker verification
    Dean, David
    Sridharan, Sridha
    [J]. COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02): : 136 - 149
  • [6] Multi-scale network with shared cross-attention for audio-visual correlation learning
    Zhang, Jiwei
    Yu, Yi
    Tang, Suhua
    Li, Wei
    Wu, Jianming
    [J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (27): : 20173 - 20187
  • [7] A MULTI-VIEW APPROACH TO AUDIO-VISUAL SPEAKER VERIFICATION
    Sari, Leda
    Singh, Kritika
    Zhou, Jiatong
    Torresani, Lorenzo
    Singhal, Nayan
    Saraf, Yatharth
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6194 - 6198
  • [8] AUDIO-VISUAL SPEAKER LOCALIZATION VIA WEIGHTED CLUSTERING
    Gebru, Israel D.
    Alameda-Pineda, Xavier
    Horaud, Radu
    Forbes, Florence
    [J]. 2014 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2014,
  • [9] Efficient Audio-visual Cross-modal Speaker Tagging via Supervised Joint Correspondence Auto-encoder
    Liu Xin
    Li Heyang
    Zhong Bineng
    Du Jixiang
    [J]. JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2018, 40 (07) : 1635 - 1642
  • [10] Audio-Visual Fusion Based on Interactive Attention for Person Verification
    Jing, Xuebin
    He, Liang
    Song, Zhida
    Wang, Shaolei
    [J]. SENSORS, 2023, 23 (24)